[jira] [Updated] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser
[ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1201: --- Summary: Add possibility for switching to pdfbox NonSequentialPDFParser (was: Add option for switching to pdfbox NonSequentialPDFParser) Add possibility for switching to pdfbox NonSequentialPDFParser -- Key: TIKA-1201 URL: https://issues.apache.org/jira/browse/TIKA-1201 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Environment: all Reporter: Hong-Thai Nguyen Priority: Critical As discussing, we can improve PDF extraction by 45% with this new NonSequentialPDFParser and fit more with PDF specification. This parser will be integrated by default in pdfbox 2.0. ref.: https://issues.apache.org/jira/browse/PDFBOX-1104 http://pdfbox.apache.org/ideas.html We should provide an extended parser or parameter current PDFParser to call: {code} PDDocument.loadNonSeq(file, scratchFile); {code} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser
[ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1201: -- Attachment: TIKA-1201.patch Trivial patch Add possibility for switching to pdfbox NonSequentialPDFParser -- Key: TIKA-1201 URL: https://issues.apache.org/jira/browse/TIKA-1201 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4 Environment: all Reporter: Hong-Thai Nguyen Assignee: Tim Allison Priority: Critical Attachments: TIKA-1201.patch As discussing, we can improve PDF extraction by 45% with this new NonSequentialPDFParser and fit more with PDF specification. This parser will be integrated by default in pdfbox 2.0. ref.: https://issues.apache.org/jira/browse/PDFBOX-1104 http://pdfbox.apache.org/ideas.html We should provide an extended parser or parameter current PDFParser to call: {code} PDDocument.loadNonSeq(file, scratchFile); {code} -- This message was sent by Atlassian JIRA (v6.1#6144)