[
https://issues.apache.org/jira/browse/TIKA-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902897#action_12902897
]
Gregory Kanevsky commented on TIKA-100:
---------------------------------------
This issue seems to be partially fixed. PDF2XHTML generates <div><p> and
</p></div> to start and end each page.
Another issue that is part of this is ordering of pdf content. PDF2XHTML
extends PDFBox PDFTextStripper to extract text. By default (for performance
reasons) 'sortByPosition' mode is turned off for PDFTextStripper.
I propose to introduce metadata property (input) that would turn it on if
desired. I am not sure about conventions on how such metadata properties are
defined (if any) though. The setting of the mode would take place in the
PDF2XHTML constructor:
private PDF2XHTML(ContentHandler handler, Metadata metadata)
throws IOException {
if (metadata.get("setSortByPosition").equalsIgnoreCase("true")) {
setSortByPosition(true);
}
....
> Structured PDF parsing
> ----------------------
>
> Key: TIKA-100
> URL: https://issues.apache.org/jira/browse/TIKA-100
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Jukka Zitting
> Priority: Minor
>
> The PDF parser currently extracts and outputs document content as a single
> string. PDFBox could be used to support structuring at least down to page and
> paragraph (not sure how accurate) level.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.