Dear Pierre Huttin,
Am 14.06.2012 10:07, schrieb [email protected]:
Many thanks, I have attached the file to the issue.
Thanks.
Now it work fine for this kind of documents, but I have a side effect
on other documents, who works fine in the past.
I receive the following error message.
Caused by: java.io.IOException: Error: Expected an integer type,
actual='xref'
org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1541)
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:354)
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:266)
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:574)
org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1124)
org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1107)
If I use the PDDocument.load() method I receive this warning message :
14 juin 2012 09:58:30 org.apache.pdfbox.pdfparser.XrefTrailerResolver
setStartxref
ATTENTION: Did not found XRef object at specified startxref position
173
but the document is correctly loaded by PDFBox.
As I see it the document is broken because the offset specified in
startxref does not point to start of xref section. Since
NonSequentialPDFParser currently has only a few options to recover from
parsing problems it stops throwing an exception. With PDDocument.load
you use the standard PDFParser which can better cope with corrupt xref
definition (ignoring it and detecting start of objects by itself) but
has other problems because it does not use xref definitions in some
cases. Thus to get the best of both you should first use
PDDocument.loadNonSeq() and if this fails (exception) try again (fall
back) with PDDocument.load().
I have a problemn for the sample file, because it contains some
confidential datas in it.
It is quite clear to me that startxref is wrong. However you could send
only the tail (which contains the 'startxref' and following lines) and
the first 220 byte of the file (according to the exception xref is
supposed to start at 173). With this information which shouldn't contain
any confidential data I could verify the diagnose.
Best regards,
Timo
On Thu, 14 Jun 2012 00:23:49 +0200, Timo Boehme
<[email protected]> wrote:
Am 13.06.2012 14:02, schrieb [email protected]:
Sorry,
apparently the pdf was not correctly attached to the previous mail, I
just zip it and re-attach it.
Pierre Huttin
With resolving PDFBOX-1099
(https://issues.apache.org/jira/browse/PDFBOX-1099) the page count is
correct with both parsers (NonSequentialPDFParser and PDFParser).
For testing purposes it would be helpful to have your example PDF
associated with PDFBOX-1099. Could you upload it to this issue (and
tick the 'Grant license to ASF for inclusion in ASF works (as per the
Apache License §5)' or give permission to do so with your file
attached to previous email with license grant?
Best regards,
Timo
On Wed, 13 Jun 2012 13:56:50 +0200,<[email protected]> wrote:
Hello,
I have some trouble with documents the library is not not able to
retreive the number of pages and load them into the list using
PDDocument.getDocumentCatalog().getAllPages() method.
The pdf file and the java code to retreive the number of pages are
attached to this mail. apparently it's look like the PDFParser do not
read correctly the /Pages object the ref of pages are "8 0" and "19
0".
I open the document correctly with adobe reader and itextrups, both
retrieve the correct number of pages : 2.
I try to run my code using the version 1.7.0 of PDFBox
Thanks in advance for your help.
Best regards
Pierre Huttin
--
Timo Boehme
OntoChem GmbH
H.-Damerow-Str. 4
06120 Halle/Saale
T: +49 345 4780474
F: +49 345 4780471
[email protected]
_____________________________________________________________________
OntoChem GmbH
Geschäftsführer: Dr. Lutz Weber
Sitz: Halle / Saale
Registergericht: Stendal
Registernummer: HRB 215461
_____________________________________________________________________