[jira] [Updated] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

2013-12-02 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1201:
---

Summary: Add possibility for switching to pdfbox NonSequentialPDFParser  
(was: Add option for switching to pdfbox NonSequentialPDFParser)

 Add possibility for switching to pdfbox NonSequentialPDFParser
 --

 Key: TIKA-1201
 URL: https://issues.apache.org/jira/browse/TIKA-1201
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: all
Reporter: Hong-Thai Nguyen
Priority: Critical

 As discussing, we can improve PDF extraction by 45% with this new 
 NonSequentialPDFParser and fit more with PDF specification. This parser will 
 be integrated by default in pdfbox 2.0.
 ref.: 
 https://issues.apache.org/jira/browse/PDFBOX-1104
 http://pdfbox.apache.org/ideas.html
 We should provide an extended parser or parameter current PDFParser to call:
 {code}
 PDDocument.loadNonSeq(file, scratchFile);
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

2013-12-02 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1201:
--

Attachment: TIKA-1201.patch

Trivial patch

 Add possibility for switching to pdfbox NonSequentialPDFParser
 --

 Key: TIKA-1201
 URL: https://issues.apache.org/jira/browse/TIKA-1201
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: all
Reporter: Hong-Thai Nguyen
Assignee: Tim Allison
Priority: Critical
 Attachments: TIKA-1201.patch


 As discussing, we can improve PDF extraction by 45% with this new 
 NonSequentialPDFParser and fit more with PDF specification. This parser will 
 be integrated by default in pdfbox 2.0.
 ref.: 
 https://issues.apache.org/jira/browse/PDFBOX-1104
 http://pdfbox.apache.org/ideas.html
 We should provide an extended parser or parameter current PDFParser to call:
 {code}
 PDDocument.loadNonSeq(file, scratchFile);
 {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)