[jira] [Commented] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

Timo Boehme (JIRA) Tue, 03 Dec 2013 01:23:21 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13837482#comment-13837482
 ]


Timo Boehme commented on TIKA-1201:
-----------------------------------

Hi,
I would only like to clarify what can be expected by using  the 
NonSequentialPDFParser. PDFBOX-1104 was only a starting point. The parser as it 
is implemented now can be found in issue PDFBOX-1199. While in principle the 
parser could be faster for extracting single pages, it currently parses the 
whole document because the other classes working on parser output expect all 
objects to be available (on demand parsing might be available in version 2). 
Thus it is (only) faster if document contains unused objects (e.g. after 
document was edited), which the 'old' parser analyzes.
However the real advantage in using this parser is that it is much more conform 
to PDF specification and has no problems with unused content in PDF files 
(where the 'old' one often failed).
Differences in behavior/result to the 'old' parser may arise if
- the document contains unused content (the 'old' parser may interpret/use it)
- the document is not valid PDF ('new' parser needs correct XREF table entry 
while the 'old' one finds the objects during parsing)
- the document was edited ('new' parser should correctly parse the latest 
version; the 'old' parser may not in every case)

Thus it is highly recommended to use the 'new' parser - however not because of 
the speed but because of its much better parsing capabilities.

> Add possibility for switching to pdfbox NonSequentialPDFParser
> --------------------------------------------------------------
>
>                 Key: TIKA-1201
>                 URL: https://issues.apache.org/jira/browse/TIKA-1201
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>         Environment: all
>            Reporter: Hong-Thai Nguyen
>            Assignee: Tim Allison
>            Priority: Critical
>             Fix For: 1.5
>
>         Attachments: TIKA-1201.patch
>
>
> As discussing, we can improve PDF extraction by 45% with this new 
> NonSequentialPDFParser and fit more with PDF specification. This parser will 
> be integrated by default in pdfbox 2.0.
> ref.: 
> https://issues.apache.org/jira/browse/PDFBOX-1104
> http://pdfbox.apache.org/ideas.html
> We should provide an extended parser or parameter current PDFParser to call:
> {code}
> PDDocument.loadNonSeq(file, scratchFile);
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

Reply via email to