Thanks, that's very useful!

My only concern would be:

> According to the PDF spec, this should not be needed,
> but makes the parsing more robust, since there might be files, which
> store information in the table, but not in the stream.

We generally don't implement hypothetical aspects of non-spec PDFs without an 
example file, 

-- John

> On 7 Aug 2014, at 02:44, Martin Tappler <[email protected]> wrote:
> 
> Hi,
> 
> I am looking at PDF files at COS level and I found that the current
> implementation of the non-sequential parser does not provide support for
> hybrid cross references (which was discussed before in this mailing list).
> 
> Look at the PDF file at [1] for instance. It contains a structure tree,
> which is hidden in the hybrid-reference file (actually such an example
> is also described in the PDF reference section 3.4.7. under
> "Compatibility with applications that do not support PDF 1.5."). The
> root of the structure tree is the object with object number 28 and
> generation number 0 and is contained in an object stream, which is only
> referenced in the cross reference stream, which is not parsed by the
> current implementation.
> 
> I used version 1.8.6. from the maven repository and also the latest
> source version from the trunk to reproduce this behavior.
> 
> However, I came up with a fix which works for me and which should not
> break anything. After parsing the cross reference table and the trailer,
> the trailer should be checked for an "XrefStm" entry. If this entry is
> present, the stream at the given offset should be parsed using
> parseXrefObjStream, but with the offset of the cross reference table as
> argument (this is done to ensure that the resolving process works as
> expected). This replaces the recently parsed information (table and
> trailer) in the XrefTrailerResolver, which should be stored in temporary
> variables. After this is done, the information contained in the cross
> reference stream is updated with the old trailer and the cross reference
> table information. According to the PDF spec, this should not be needed,
> but makes the parsing more robust, since there might be files, which
> store information in the table, but not in the stream. So this ensures
> that no information is lost.
> 
> Please find patches for the fix attached. I hope they are useful.
> 
> Best regards,
> Martin Tappler
> 
> [1] http://bewerbung.fh-kaernten.at/fileadmin/Anleitung-PDF-erstellen.pdf
> <NonSequentialPDFParser.patch>
> <XrefTrailerResolver.patch>

Reply via email to