[ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-1201. ------------------------------- Resolution: Fixed Fix Version/s: 1.5 Basic parameter-based capability added in r1547250. User beware that there may be differences in metadata processing between the NonSequentialPDFParser and the traditional parser. Will open issue to track failure to extract metadata from testAnnotations.pdf. > Add possibility for switching to pdfbox NonSequentialPDFParser > -------------------------------------------------------------- > > Key: TIKA-1201 > URL: https://issues.apache.org/jira/browse/TIKA-1201 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.4 > Environment: all > Reporter: Hong-Thai Nguyen > Assignee: Tim Allison > Priority: Critical > Fix For: 1.5 > > Attachments: TIKA-1201.patch > > > As discussing, we can improve PDF extraction by 45% with this new > NonSequentialPDFParser and fit more with PDF specification. This parser will > be integrated by default in pdfbox 2.0. > ref.: > https://issues.apache.org/jira/browse/PDFBOX-1104 > http://pdfbox.apache.org/ideas.html > We should provide an extended parser or parameter current PDFParser to call: > {code} > PDDocument.loadNonSeq(file, scratchFile); > {code} -- This message was sent by Atlassian JIRA (v6.1#6144)