Thanks, that's very useful! My only concern would be:
> According to the PDF spec, this should not be needed, > but makes the parsing more robust, since there might be files, which > store information in the table, but not in the stream. We generally don't implement hypothetical aspects of non-spec PDFs without an example file, -- John > On 7 Aug 2014, at 02:44, Martin Tappler <[email protected]> wrote: > > Hi, > > I am looking at PDF files at COS level and I found that the current > implementation of the non-sequential parser does not provide support for > hybrid cross references (which was discussed before in this mailing list). > > Look at the PDF file at [1] for instance. It contains a structure tree, > which is hidden in the hybrid-reference file (actually such an example > is also described in the PDF reference section 3.4.7. under > "Compatibility with applications that do not support PDF 1.5."). The > root of the structure tree is the object with object number 28 and > generation number 0 and is contained in an object stream, which is only > referenced in the cross reference stream, which is not parsed by the > current implementation. > > I used version 1.8.6. from the maven repository and also the latest > source version from the trunk to reproduce this behavior. > > However, I came up with a fix which works for me and which should not > break anything. After parsing the cross reference table and the trailer, > the trailer should be checked for an "XrefStm" entry. If this entry is > present, the stream at the given offset should be parsed using > parseXrefObjStream, but with the offset of the cross reference table as > argument (this is done to ensure that the resolving process works as > expected). This replaces the recently parsed information (table and > trailer) in the XrefTrailerResolver, which should be stored in temporary > variables. After this is done, the information contained in the cross > reference stream is updated with the old trailer and the cross reference > table information. According to the PDF spec, this should not be needed, > but makes the parsing more robust, since there might be files, which > store information in the table, but not in the stream. So this ensures > that no information is lost. > > Please find patches for the fix attached. I hope they are useful. > > Best regards, > Martin Tappler > > [1] http://bewerbung.fh-kaernten.at/fileadmin/Anleitung-PDF-erstellen.pdf > <NonSequentialPDFParser.patch> > <XrefTrailerResolver.patch>
