[jira] [Commented] (PDFBOX-1014) Unused XRef object streams cause parser to fail + FIX

Timo Boehme (JIRA) Wed, 18 May 2011 08:58:32 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035460#comment-13035460
 ]


Timo Boehme commented on PDFBOX-1014:
-------------------------------------

The provided fix needs an addition: the array of objects returned by 
getObjectsByType is not in objectId order but in random order (depending on 
hashing). Therefore we have to order the xrefs by object id (assuming that 
higher object numbers follow lower ones).

Thus the fix is now:

    public void parseXrefStreams() throws IOException
    {
        COSDictionary trailerDict = new COSDictionary();
        
        // tb: use only last XRef and XRefs which are referenced by a used XRef 
via 'Prev'
        // we assume that 'Prev' will reference next preceding xref object
        // (otherwise we would have to use object byte positions)
        List<COSObject> xrefStreams  = getObjectsByType( "XRef" );
        
        // sort objects by number / this is only a workaround assuming that
        // objects with higher number follow objects with lower number;
        // (again we would need byte positions here) 
        Collections.sort( xrefStreams, new Comparator<COSObject>() {
                                        @Override public int compare(COSObject 
o1, COSObject o2) {
                                                int cmp = 
o1.getObjectNumber().intValue() - o2.getObjectNumber().intValue();
                                                return ( cmp == 0 ) ? 
o1.getGenerationNumber().intValue() - o2.getGenerationNumber().intValue() :
                                                                            cmp;
                                        }
                                });
        
        int firstXRefIdx = xrefStreams.size() - 1;
        while ( firstXRefIdx > 0 ) {
                COSStream stream = (COSStream)xrefStreams.get( firstXRefIdx 
).getObject();
                if ( stream.getInt( COSName.PREV, -1 ) == -1 )
                        // no 'Prev' key; current xref object will be first one 
we use
                        break;
        }
        
        for ( int xrefIdx = firstXRefIdx, len = xrefStreams.size(); xrefIdx < 
len; xrefIdx++ )
        {
            COSStream stream = (COSStream)xrefStreams.get( xrefIdx 
).getObject();
            trailerDict.addAll(stream);
            PDFXrefStreamParser parser =
                new PDFXrefStreamParser(stream, this, forceParsing);
            parser.parse();
        }
        setTrailer( trailerDict );
    }


> Unused XRef object streams cause parser to fail + FIX
> -----------------------------------------------------
>
>                 Key: PDFBOX-1014
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1014
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>
> I have a PDF document with 3 XRef streams (no xref table; PDF version 1.6). 
> Currently PDFBOX reads and parses all 3 streams in the order the appear and 
> combines the data in a dictionary (thus attributes specified in a later XRef 
> stream overwrite attributes in earlier streams). The problem with my document 
> is that the first 2 XRef streams declare document encryption while the last 
> one does not. Furthermore the last one uses another document id thus trying 
> to decrypt the document would fail because of the different IDs (however 
> already the parsing of the stream in the first XRef object already fails.
> The solution I came up with is to first get all XRef streams, start looking 
> from last one if it contains a 'Prev' key and go up the list as long as we 
> have this 'Prev' key. This should work in most cases assuming that multiple 
> active XRef sections appear in order without an unused XRef section in 
> between. A really correct solution would have to test for object byte 
> positions (therefore it would be necessary to store byte positions for each 
> object). 
> The fix in COSDocument.parseXrefStreams():
>     public void parseXrefStreams() throws IOException
>     {
>         COSDictionary trailerDict = new COSDictionary();
>         
>         // use only last XRef and XRef which are referenced by a used XRef 
> via 'Prev'
>         // we assume that 'Prev' will reference next preceding xref object
>         // (otherwise we would have to use object byte positions)
>         List<COSObject> xrefStreams  = getObjectsByType( "XRef" );
>         int             firstXRefIdx = xrefStreams.size() - 1;
>         while ( firstXRefIdx > 0 ) {
>               COSStream stream = (COSStream)xrefStreams.get( firstXRefIdx 
> ).getObject();
>               if ( stream.getInt( COSName.PREV, -1 ) == -1 )
>                       // no 'Prev' key; current xref object will be first one 
> we use
>                       break;
>         }
>         
> //        for( COSObject xrefStream : getObjectsByType( "XRef" ) )
>         for ( int xrefIdx = firstXRefIdx, len = xrefStreams.size(); xrefIdx < 
> len; xrefIdx++ )
>         {
>             COSStream stream = (COSStream)xrefStreams.get( xrefIdx 
> ).getObject();
>             trailerDict.addAll(stream);
>             PDFXrefStreamParser parser =
>                 new PDFXrefStreamParser(stream, this, forceParsing);
>             parser.parse();
>         }
>         setTrailer( trailerDict );
>     }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1014) Unused XRef object streams cause parser to fail + FIX

Reply via email to