On 08/07/2014 06:57 PM, John Hewson wrote:
> Thanks, that's very useful!
Great!
> 
> My only concern would be:
> 
>> According to the PDF spec, this should not be needed,
>> but makes the parsing more robust, since there might be files, which
>> store information in the table, but not in the stream.
> 
> We generally don't implement hypothetical aspects of non-spec PDFs without an 
> example file, 

Using my test files, I just checked if it is really unnecessary, i.e. if
parsing without adding cross reference table information leads to the
same result. It is actually necessary, see for instance the PDF file at
[1]. Since it seems to be from Adobe, I most probably have
misinterpreted the PDF spec, sorry. The PDF files is linearized and
contains an XrefStm entry in the first trailer and the corresponding
cross reference streams contains only references to objects inside an
object stream, while the catalog reference is only contained in the
cross reference table.

Best regards,
Martin

[1]
http://partners.adobe.com/public/developer/en/xml/AdobeXMLFormsSamples.pdf

> 
> -- John
> 
>> On 7 Aug 2014, at 02:44, Martin Tappler <[email protected]> wrote:
>>
>> Hi,
>>
>> I am looking at PDF files at COS level and I found that the current
>> implementation of the non-sequential parser does not provide support for
>> hybrid cross references (which was discussed before in this mailing list).
>>
>> Look at the PDF file at [1] for instance. It contains a structure tree,
>> which is hidden in the hybrid-reference file (actually such an example
>> is also described in the PDF reference section 3.4.7. under
>> "Compatibility with applications that do not support PDF 1.5."). The
>> root of the structure tree is the object with object number 28 and
>> generation number 0 and is contained in an object stream, which is only
>> referenced in the cross reference stream, which is not parsed by the
>> current implementation.
>>
>> I used version 1.8.6. from the maven repository and also the latest
>> source version from the trunk to reproduce this behavior.
>>
>> However, I came up with a fix which works for me and which should not
>> break anything. After parsing the cross reference table and the trailer,
>> the trailer should be checked for an "XrefStm" entry. If this entry is
>> present, the stream at the given offset should be parsed using
>> parseXrefObjStream, but with the offset of the cross reference table as
>> argument (this is done to ensure that the resolving process works as
>> expected). This replaces the recently parsed information (table and
>> trailer) in the XrefTrailerResolver, which should be stored in temporary
>> variables. After this is done, the information contained in the cross
>> reference stream is updated with the old trailer and the cross reference
>> table information. According to the PDF spec, this should not be needed,
>> but makes the parsing more robust, since there might be files, which
>> store information in the table, but not in the stream. So this ensures
>> that no information is lost.
>>
>> Please find patches for the fix attached. I hope they are useful.
>>
>> Best regards,
>> Martin Tappler
>>
>> [1] http://bewerbung.fh-kaernten.at/fileadmin/Anleitung-PDF-erstellen.pdf
>> <NonSequentialPDFParser.patch>
>> <XrefTrailerResolver.patch>

Reply via email to