Re: Problem to parse a PDF document

Timo Boehme Thu, 14 Jun 2012 01:51:05 -0700

Dear Pierre Huttin,

Am 14.06.2012 10:07, schrieb [email protected]:

Many thanks, I have attached the file to the issue.


Thanks.

Now it work fine for this kind of documents, but I have a side effect
on other documents, who works fine in the past.

I receive the following error message.

Caused by: java.io.IOException: Error: Expected an integer type,
actual='xref'
org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1541)
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:354)
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:266)
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:574)
org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1124)
org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1107)

If I use the PDDocument.load() method I receive this warning message :

14 juin 2012 09:58:30 org.apache.pdfbox.pdfparser.XrefTrailerResolver
setStartxref
ATTENTION: Did not found XRef object at specified startxref position
173

but the document is correctly loaded by PDFBox.

As I see it the document is broken because the offset specified instartxref does not point to start of xref section. SinceNonSequentialPDFParser currently has only a few options to recover fromparsing problems it stops throwing an exception. With PDDocument.loadyou use the standard PDFParser which can better cope with corrupt xrefdefinition (ignoring it and detecting start of objects by itself) buthas other problems because it does not use xref definitions in somecases. Thus to get the best of both you should first usePDDocument.loadNonSeq() and if this fails (exception) try again (fallback) with PDDocument.load().

I have a problemn for the sample file, because it contains some
confidential datas in it.

It is quite clear to me that startxref is wrong. However you could sendonly the tail (which contains the 'startxref' and following lines) andthe first 220 byte of the file (according to the exception xref issupposed to start at 173). With this information which shouldn't containany confidential data I could verify the diagnose.



Best regards,
Timo

On Thu, 14 Jun 2012 00:23:49 +0200, Timo Boehme
<[email protected]>  wrote:

Am 13.06.2012 14:02, schrieb [email protected]:

Sorry,

apparently the pdf was not correctly attached to the previous mail, I
just zip it and re-attach it.

Pierre Huttin


With resolving PDFBOX-1099
(https://issues.apache.org/jira/browse/PDFBOX-1099) the page count is
correct with both parsers (NonSequentialPDFParser and PDFParser).

For testing purposes it would be helpful to have your example PDF
associated with PDFBOX-1099. Could you upload it to this issue (and
tick the 'Grant license to ASF for inclusion in ASF works (as per the
Apache License §5)' or give permission to do so with your file
attached to previous email with license grant?


Best regards,
Timo


On Wed, 13 Jun 2012 13:56:50 +0200,<[email protected]>   wrote:

Hello,

I have some trouble with documents the library is not not able to
retreive the number of pages and load them into the list using
PDDocument.getDocumentCatalog().getAllPages() method.

The pdf file and the java code to retreive the number of pages are
attached to this mail. apparently it's look like the PDFParser do not
read correctly the /Pages object the ref of pages are "8 0" and "19
0".

I open the document correctly with adobe reader and itextrups, both
retrieve the correct number of pages : 2.

I try to run my code using the version 1.7.0 of PDFBox

Thanks in advance for your help.

Best regards

Pierre Huttin



--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 [email protected]

_____________________________________________________________________

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_____________________________________________________________________

Re: Problem to parse a PDF document

Reply via email to