[ 
https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1201:
------------------------------

    Attachment: TIKA-1201.patch

Trivial patch

> Add possibility for switching to pdfbox NonSequentialPDFParser
> --------------------------------------------------------------
>
>                 Key: TIKA-1201
>                 URL: https://issues.apache.org/jira/browse/TIKA-1201
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>         Environment: all
>            Reporter: Hong-Thai Nguyen
>            Assignee: Tim Allison
>            Priority: Critical
>         Attachments: TIKA-1201.patch
>
>
> As discussing, we can improve PDF extraction by 45% with this new 
> NonSequentialPDFParser and fit more with PDF specification. This parser will 
> be integrated by default in pdfbox 2.0.
> ref.: 
> https://issues.apache.org/jira/browse/PDFBOX-1104
> http://pdfbox.apache.org/ideas.html
> We should provide an extended parser or parameter current PDFParser to call:
> {code}
> PDDocument.loadNonSeq(file, scratchFile);
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to