Hi,

I am looking at PDF files at COS level and I found that the current
implementation of the non-sequential parser does not provide support for
hybrid cross references (which was discussed before in this mailing list).

Look at the PDF file at [1] for instance. It contains a structure tree,
which is hidden in the hybrid-reference file (actually such an example
is also described in the PDF reference section 3.4.7. under
"Compatibility with applications that do not support PDF 1.5."). The
root of the structure tree is the object with object number 28 and
generation number 0 and is contained in an object stream, which is only
referenced in the cross reference stream, which is not parsed by the
current implementation.

I used version 1.8.6. from the maven repository and also the latest
source version from the trunk to reproduce this behavior.

However, I came up with a fix which works for me and which should not
break anything. After parsing the cross reference table and the trailer,
the trailer should be checked for an "XrefStm" entry. If this entry is
present, the stream at the given offset should be parsed using
parseXrefObjStream, but with the offset of the cross reference table as
argument (this is done to ensure that the resolving process works as
expected). This replaces the recently parsed information (table and
trailer) in the XrefTrailerResolver, which should be stored in temporary
variables. After this is done, the information contained in the cross
reference stream is updated with the old trailer and the cross reference
table information. According to the PDF spec, this should not be needed,
but makes the parsing more robust, since there might be files, which
store information in the table, but not in the stream. So this ensures
that no information is lost.

Please find patches for the fix attached. I hope they are useful.

Best regards,
Martin Tappler

[1] http://bewerbung.fh-kaernten.at/fileadmin/Anleitung-PDF-erstellen.pdf
--- PDFBox_source/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/NonSequentialPDFParser.java	2014-08-06 09:42:16.667300644 +0200
+++ PDFAnalysis/src/main/java/org/apache/pdfbox/pdfparser/NonSequentialPDFParser.java	2014-08-07 11:02:23.592823474 +0200
@@ -341,6 +341,7 @@
             // -- parse xref
             if (pdfSource.peek() == X)
             {
+                long startOfXref = prev;
                 // xref table and trailer
                 // use existing parser to parse xref table
                 parseXrefTable(prev);
@@ -363,6 +364,16 @@
                             + pdfSource.getOffset());
                 }
                 COSDictionary trailer = xrefTrailerResolver.getCurrentTrailer();
+                Map<COSObjectKey, Long> currentXrefs = xrefTrailerResolver.getCurrentXrefTable();
+                if(trailer.containsKey("XRefStm")){
+                    int streamOffset = trailer.getInt("XRefStm");
+                    setPdfSource(streamOffset);
+                    skipSpaces();
+                    parseXrefObjStream(startOfXref); 
+                    // we want to remember information from the trailer
+                    xrefTrailerResolver.getCurrentTrailer().addAll(trailer);
+                    xrefTrailerResolver.getCurrentXrefTable().putAll(currentXrefs);
+                }
                 prev = trailer.getInt(COSName.PREV);
                 if (isLenient && prev > -1)
                 {
--- PDFBox_source/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/XrefTrailerResolver.java	2014-07-25 09:19:20.596046163 +0200
+++ PDFAnalysis/src/main/java/org/apache/pdfbox/pdfparser/XrefTrailerResolver.java	2014-08-07 11:06:21.390002645 +0200
@@ -211,6 +211,16 @@
     }
 
     /**
+     * Returns the current xref table.
+     * 
+     * @return the current xref table.
+     * 
+     */
+    public Map<COSObjectKey, Long> getCurrentXrefTable()
+    {
+        return curXrefTrailerObj.xrefTable; 
+    }
+    /**
      * Sets the byte position of the first XRef
      * (has to be called after very last startxref was read).
      * This is used to resolve chain of active XRef/trailer.
@@ -234,6 +244,7 @@
 
         resolvedXrefTrailer = new XrefTrailerObj();
         resolvedXrefTrailer.trailer = new COSDictionary();
+        resolvedXrefTrailer.xrefType = curXrefTrailerObj.xrefType;
 
         XrefTrailerObj curObj = bytePosToXrefMap.get( startxrefBytePosValue );
         List<Long>  xrefSeqBytePos = new ArrayList<Long>();

Reply via email to