[ 
https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1201.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 1.5

Basic parameter-based capability added in r1547250.  User beware that there may 
be differences in metadata processing between the NonSequentialPDFParser and 
the traditional parser.  Will open issue to track failure to extract metadata 
from testAnnotations.pdf.

> Add possibility for switching to pdfbox NonSequentialPDFParser
> --------------------------------------------------------------
>
>                 Key: TIKA-1201
>                 URL: https://issues.apache.org/jira/browse/TIKA-1201
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>         Environment: all
>            Reporter: Hong-Thai Nguyen
>            Assignee: Tim Allison
>            Priority: Critical
>             Fix For: 1.5
>
>         Attachments: TIKA-1201.patch
>
>
> As discussing, we can improve PDF extraction by 45% with this new 
> NonSequentialPDFParser and fit more with PDF specification. This parser will 
> be integrated by default in pdfbox 2.0.
> ref.: 
> https://issues.apache.org/jira/browse/PDFBOX-1104
> http://pdfbox.apache.org/ideas.html
> We should provide an extended parser or parameter current PDFParser to call:
> {code}
> PDDocument.loadNonSeq(file, scratchFile);
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to