[
https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-1201.
-------------------------------
Resolution: Fixed
Fix Version/s: 1.5
Basic parameter-based capability added in r1547250. User beware that there may
be differences in metadata processing between the NonSequentialPDFParser and
the traditional parser. Will open issue to track failure to extract metadata
from testAnnotations.pdf.
> Add possibility for switching to pdfbox NonSequentialPDFParser
> --------------------------------------------------------------
>
> Key: TIKA-1201
> URL: https://issues.apache.org/jira/browse/TIKA-1201
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.4
> Environment: all
> Reporter: Hong-Thai Nguyen
> Assignee: Tim Allison
> Priority: Critical
> Fix For: 1.5
>
> Attachments: TIKA-1201.patch
>
>
> As discussing, we can improve PDF extraction by 45% with this new
> NonSequentialPDFParser and fit more with PDF specification. This parser will
> be integrated by default in pdfbox 2.0.
> ref.:
> https://issues.apache.org/jira/browse/PDFBOX-1104
> http://pdfbox.apache.org/ideas.html
> We should provide an extended parser or parameter current PDFParser to call:
> {code}
> PDDocument.loadNonSeq(file, scratchFile);
> {code}
--
This message was sent by Atlassian JIRA
(v6.1#6144)