[ 
https://issues.apache.org/jira/browse/PDFBOX-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878913#action_12878913
 ] 

David Hedley commented on PDFBOX-720:
-------------------------------------


Having just looked at the code for handling cross-reference streams I'm not 
surprised it's failing at random - the logic in PDFParser for handling these 
objects is somewhat broken.

Rather than following the PDF spec, it simply merges together all the XRef 
objects in the PDF file (of which there will be several for files which have 
been incrementally updated). The ordering of the merging is dependant on the 
implementation of HashMap on the host system, so I'm not surprised you're 
seeing different results on Linux and Windows.


> Inconsistency in parsing PDFs between Windows and Linux
> -------------------------------------------------------
>
>                 Key: PDFBOX-720
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-720
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Windows Vista 32-bit, Sun JDK 1.5.0_06, PDFBox HEAD tag 
> (revision 941073)
> vs.
> Red Hat Linux, 2.6.9-67.ELsmp kernel, Java 1.5.0_06, PDFBox HEAD tag 
> (revision 941073)
>            Reporter: Adam Nichols
>             Fix For: 1.2.0
>
>         Attachments: 238_Page_Report.pdf
>
>
> Run this same code using the same PDF and you'll get different results on 
> Linux than on Windows.  Regardless of which one you consider "correct", it 
> should be consistent.
> doc = PDDocument.load(inputFile);
> PDDocumentOutline outline = doc.getDocumentCatalog().getDocumentOutline();
> if(outline == null)
>     System.out.println("Document outline was null");
> else
>     System.out.println("Document outline was not null");
> Some interesting notes about this PDF: Seems that Acrobat Distiller 8.1.0 
> basically just concatenated two PDFs into one.  There are two trailers, they 
> both refer to object "1600 0" as the root.  1600 0 appears multiple times, 
> one time it doesn't have "Outlines" in the dictionary, the other time it has 
> "Outlines 1667 0".  Windows picks up the latter and shows the outline 
> correctly.  Linux picks up the former and thus returns null for the outline.  
> I tried debugging through PDFParser and BaseParser, but I'm not really sure 
> how that code works and I quickly got lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to