[
https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13837482#comment-13837482
]
Timo Boehme commented on TIKA-1201:
-----------------------------------
Hi,
I would only like to clarify what can be expected by using the
NonSequentialPDFParser. PDFBOX-1104 was only a starting point. The parser as it
is implemented now can be found in issue PDFBOX-1199. While in principle the
parser could be faster for extracting single pages, it currently parses the
whole document because the other classes working on parser output expect all
objects to be available (on demand parsing might be available in version 2).
Thus it is (only) faster if document contains unused objects (e.g. after
document was edited), which the 'old' parser analyzes.
However the real advantage in using this parser is that it is much more conform
to PDF specification and has no problems with unused content in PDF files
(where the 'old' one often failed).
Differences in behavior/result to the 'old' parser may arise if
- the document contains unused content (the 'old' parser may interpret/use it)
- the document is not valid PDF ('new' parser needs correct XREF table entry
while the 'old' one finds the objects during parsing)
- the document was edited ('new' parser should correctly parse the latest
version; the 'old' parser may not in every case)
Thus it is highly recommended to use the 'new' parser - however not because of
the speed but because of its much better parsing capabilities.
> Add possibility for switching to pdfbox NonSequentialPDFParser
> --------------------------------------------------------------
>
> Key: TIKA-1201
> URL: https://issues.apache.org/jira/browse/TIKA-1201
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.4
> Environment: all
> Reporter: Hong-Thai Nguyen
> Assignee: Tim Allison
> Priority: Critical
> Fix For: 1.5
>
> Attachments: TIKA-1201.patch
>
>
> As discussing, we can improve PDF extraction by 45% with this new
> NonSequentialPDFParser and fit more with PDF specification. This parser will
> be integrated by default in pdfbox 2.0.
> ref.:
> https://issues.apache.org/jira/browse/PDFBOX-1104
> http://pdfbox.apache.org/ideas.html
> We should provide an extended parser or parameter current PDFParser to call:
> {code}
> PDDocument.loadNonSeq(file, scratchFile);
> {code}
--
This message was sent by Atlassian JIRA
(v6.1#6144)