Re: Support for hybrid-references (also discussed in: PDFBox and implementation, state reg. PDF Spec)

Andreas Lehmkuehler Tue, 12 Aug 2014 10:23:25 -0700

Hi,

Am 08.08.2014 08:57, schrieb Martin Tappler:

On 08/07/2014 06:57 PM, John Hewson wrote:

Thanks, that's very useful!

Great!


My only concern would be:

According to the PDF spec, this should not be needed,
but makes the parsing more robust, since there might be files, which
store information in the table, but not in the stream.


We generally don't implement hypothetical aspects of non-spec PDFs without an 
example file,


Using my test files, I just checked if it is really unnecessary, i.e. if
parsing without adding cross reference table information leads to the
same result. It is actually necessary, see for instance the PDF file at
[1]. Since it seems to be from Adobe, I most probably have
misinterpreted the PDF spec, sorry. The PDF files is linearized and
contains an XrefStm entry in the first trailer and the corresponding
cross reference streams contains only references to objects inside an
object stream, while the catalog reference is only contained in the
cross reference table.

You were right in the first place. The cross reference stream may contain some
data which is already present in the xref table, but there could be some data
which are only available through the stream, e.g. offset of objects within
object streams.

I've added support for hybrid-references and we'll see if there is some room
for improvements regarding redundant reading of object ids.

@Martin: Thanks for the sample and the input.

Best regards,
Martin


BR
Andreas Lehmkühler


[1]
http://partners.adobe.com/public/developer/en/xml/AdobeXMLFormsSamples.pdf


-- John

On 7 Aug 2014, at 02:44, Martin Tappler <[email protected]> wrote:

Hi,

I am looking at PDF files at COS level and I found that the current
implementation of the non-sequential parser does not provide support for
hybrid cross references (which was discussed before in this mailing list).

Look at the PDF file at [1] for instance. It contains a structure tree,
which is hidden in the hybrid-reference file (actually such an example
is also described in the PDF reference section 3.4.7. under
"Compatibility with applications that do not support PDF 1.5."). The
root of the structure tree is the object with object number 28 and
generation number 0 and is contained in an object stream, which is only
referenced in the cross reference stream, which is not parsed by the
current implementation.

I used version 1.8.6. from the maven repository and also the latest
source version from the trunk to reproduce this behavior.

However, I came up with a fix which works for me and which should not
break anything. After parsing the cross reference table and the trailer,
the trailer should be checked for an "XrefStm" entry. If this entry is
present, the stream at the given offset should be parsed using
parseXrefObjStream, but with the offset of the cross reference table as
argument (this is done to ensure that the resolving process works as
expected). This replaces the recently parsed information (table and
trailer) in the XrefTrailerResolver, which should be stored in temporary
variables. After this is done, the information contained in the cross
reference stream is updated with the old trailer and the cross reference
table information. According to the PDF spec, this should not be needed,
but makes the parsing more robust, since there might be files, which
store information in the table, but not in the stream. So this ensures
that no information is lost.

Please find patches for the fix attached. I hope they are useful.

Best regards,
Martin Tappler

[1] http://bewerbung.fh-kaernten.at/fileadmin/Anleitung-PDF-erstellen.pdf
<NonSequentialPDFParser.patch>
<XrefTrailerResolver.patch>

Re: Support for hybrid-references (also discussed in: PDFBox and implementation, state reg. PDF Spec)

Reply via email to