[
https://issues.apache.org/jira/browse/PDFBOX-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836591#comment-13836591
]
Maruan Sahyoun commented on PDFBOX-1787:
----------------------------------------
The NonSequentialPDFParser follows more closely the PDF spec e.g. by first
looking for the Xrefs and then processing the file accordingly. The old parser
processes a PDF sequentially which is not inline with the spec. So for all apps
which are not dependent on the false assumptions of the old parser the newer
parser produces better results.
Does it always produce the same extraction results? Probably not as the nonSeq
parser ignores objects which are no longer referenced. In addition parsing a
file with incremental updates might produce different results because nonSeq
handles the updates correctly.
The NonSequentialPDFParser was developed to provide better stability and
processing inline with the spec. The reason it’s an addition and not the new
standard is because of it’s introduction in a minor release, a testing phase we
wanted to have and some missing capabilities which should not be relevant to
your type of application.
One of the current ideas for PDFBox 2.0 (no defined release date as of today!)
is to change the default parser. http://pdfbox.apache.org/ideas.html
BR
Maruan
> pdfbox hangs on a corrupt PDF file
> ----------------------------------
>
> Key: PDFBOX-1787
> URL: https://issues.apache.org/jira/browse/PDFBOX-1787
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.3
> Environment: windows
> Reporter: Hong-Thai Nguyen
> Attachments: corrupt_file.pdf
>
>
> pdfbox hangs on command line on attached file.
--
This message was sent by Atlassian JIRA
(v6.1#6144)