[jira] Commented: (PDFBOX-720) Inconsistency in parsing PDFs between Windows and Linux

David Hedley (JIRA) Thu, 17 Jun 2010 01:49:58 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879709#action_12879709
 ]


David Hedley commented on PDFBOX-720:
-------------------------------------


To fix it properly is probably going to be a big job - I would suggest that the 
whole parsing module needs to be rewritten in accordance with the PDF spec. 
e.g. start from the end of the PDF file, read the index to the first 
cross-reference stream and proceed from there to build up the xref tables etc.

However you can do a quick fix based on the assumption that the "correct" xref 
table to use will generally be towards the end of the file:

1) Change objectPool in COSDocument to be a LinkedHashMap
2) Change parseXrefStreams to the following:
    public void parseXrefStreams() throws IOException
    {
        COSDictionary trailerDict = new COSDictionary();
        COSObject lastObject = null;
        for( COSObject xrefStream : getObjectsByType( "XRef" ) )
        {
            lastObject = xrefStream;
            COSStream stream = (COSStream)xrefStream.getObject();
            PDFXrefStreamParser parser = new PDFXrefStreamParser(stream, this);
            parser.parse();
        }
        trailerDict.addAll((COSStream)lastObject.getObject());
        setTrailer( trailerDict );
    }

This will, at least, consistently choose the last xref table (which should in 
general be the right one). However it then highlights another problem, this 
time in PDFXrefStreamParser. 
With some simple debugging in, it would appear that PDFXrefStreamParser::parse 
is reading garbage in the while loop (lines 104 onwards). This has a knock-on 
effect of not properly handling objects which have been replaced.

I will continue to investigate when I have the time


> Inconsistency in parsing PDFs between Windows and Linux
> -------------------------------------------------------
>
>                 Key: PDFBOX-720
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-720
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Windows Vista 32-bit, Sun JDK 1.5.0_06, PDFBox HEAD tag 
> (revision 941073)
> vs.
> Red Hat Linux, 2.6.9-67.ELsmp kernel, Java 1.5.0_06, PDFBox HEAD tag 
> (revision 941073)
>            Reporter: Adam Nichols
>             Fix For: 1.2.0
>
>         Attachments: 238_Page_Report.pdf
>
>
> Run this same code using the same PDF and you'll get different results on 
> Linux than on Windows.  Regardless of which one you consider "correct", it 
> should be consistent.
> doc = PDDocument.load(inputFile);
> PDDocumentOutline outline = doc.getDocumentCatalog().getDocumentOutline();
> if(outline == null)
>     System.out.println("Document outline was null");
> else
>     System.out.println("Document outline was not null");
> Some interesting notes about this PDF: Seems that Acrobat Distiller 8.1.0 
> basically just concatenated two PDFs into one.  There are two trailers, they 
> both refer to object "1600 0" as the root.  1600 0 appears multiple times, 
> one time it doesn't have "Outlines" in the dictionary, the other time it has 
> "Outlines 1667 0".  Windows picks up the latter and shows the outline 
> correctly.  Linux picks up the former and thus returns null for the outline.  
> I tried debugging through PDFParser and BaseParser, but I'm not really sure 
> how that code works and I quickly got lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-720) Inconsistency in parsing PDFs between Windows and Linux

Reply via email to