[jira] Commented: (TIKA-100) Structured PDF parsing

Gregory Kanevsky (JIRA) Thu, 26 Aug 2010 08:31:35 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902897#action_12902897
 ]


Gregory Kanevsky commented on TIKA-100:
---------------------------------------

This issue seems to be partially fixed. PDF2XHTML generates <div><p> and 
</p></div> to start and end each page. 

Another issue that is part of this is ordering of pdf content. PDF2XHTML 
extends PDFBox PDFTextStripper to extract text. By default (for performance 
reasons) 'sortByPosition' mode is turned off for PDFTextStripper. 

I propose to introduce metadata property (input) that would turn it on if 
desired. I am not sure about conventions on how such metadata properties are 
defined (if any) though. The setting of the mode would take place in the 
PDF2XHTML constructor:

private PDF2XHTML(ContentHandler handler, Metadata metadata)
            throws IOException {
        
        if (metadata.get("setSortByPosition").equalsIgnoreCase("true")) {
                setSortByPosition(true);
        }

        ....

> Structured PDF parsing
> ----------------------
>
>                 Key: TIKA-100
>                 URL: https://issues.apache.org/jira/browse/TIKA-100
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> The PDF parser currently extracts and outputs document content as a single 
> string. PDFBox could be used to support structuring at least down to page and 
> paragraph (not sure how accurate) level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-100) Structured PDF parsing

Reply via email to