[ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-1201: ------------------------------ Attachment: TIKA-1201.patch Trivial patch > Add possibility for switching to pdfbox NonSequentialPDFParser > -------------------------------------------------------------- > > Key: TIKA-1201 > URL: https://issues.apache.org/jira/browse/TIKA-1201 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.4 > Environment: all > Reporter: Hong-Thai Nguyen > Assignee: Tim Allison > Priority: Critical > Attachments: TIKA-1201.patch > > > As discussing, we can improve PDF extraction by 45% with this new > NonSequentialPDFParser and fit more with PDF specification. This parser will > be integrated by default in pdfbox 2.0. > ref.: > https://issues.apache.org/jira/browse/PDFBOX-1104 > http://pdfbox.apache.org/ideas.html > We should provide an extended parser or parameter current PDFParser to call: > {code} > PDDocument.loadNonSeq(file, scratchFile); > {code} -- This message was sent by Atlassian JIRA (v6.1#6144)