[ 
https://issues.apache.org/jira/browse/PDFBOX-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836591#comment-13836591
 ] 

Maruan Sahyoun commented on PDFBOX-1787:
----------------------------------------

The NonSequentialPDFParser follows more closely the PDF spec e.g. by first 
looking for the Xrefs and then processing the file accordingly. The old parser 
processes a PDF sequentially which is not inline with the spec. So for all apps 
which are not dependent on the false assumptions of the old parser the newer 
parser produces better results. 

Does it always produce the same extraction results? Probably not as the nonSeq 
parser ignores objects which are no longer referenced. In addition parsing a 
file with incremental updates might produce different results because nonSeq 
handles the updates correctly.

The NonSequentialPDFParser was developed to provide better stability and 
processing inline with the spec. The reason it’s an addition and not the new 
standard is because of it’s introduction in a minor release, a testing phase we 
wanted to have and some missing capabilities which should not be relevant to 
your type of application.

One of the current ideas for PDFBox 2.0 (no defined release date as of today!) 
is to change the default parser. http://pdfbox.apache.org/ideas.html

BR
Maruan

> pdfbox hangs on a corrupt PDF file
> ----------------------------------
>
>                 Key: PDFBOX-1787
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1787
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.3
>         Environment: windows
>            Reporter: Hong-Thai Nguyen
>         Attachments: corrupt_file.pdf
>
>
> pdfbox hangs on command line on attached file.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to