[ 
https://issues.apache.org/jira/browse/PDFBOX-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052539#comment-13052539
 ] 

Thomas Chojecki commented on PDFBOX-1016:
-----------------------------------------

---- PDF Document with 84 objets -----
HEADER
Object
XRefStream (2) [66-84] -> point to XRefStream (1)
startxref -> point to offset 0
%%EOF 
Objects
XRefStream (1) [0-65] 
TRAILER
startxref -> point to XRefStream (2)
%%EOF

Right order, jump & read 2 -> jump & read 1 -> parse 1 -> parse 2
Your order, read & parse 2 -> read & parse 1

This document was signed 3 time (incremental) with 100 objects

So the document looks now

HEADER
Object
XRefStream (2) [66-84] -> point to XRefStream (1)
startxref -> point to offset 0
%%EOF 
Objects
XRefStream (1) [0-65] 
TRAILER
startxref -> point to XRefStream (2)
%%EOF
Objects
XRefStream (3) -> point to (2)
TRAILER
startxref -> point to XRefStream (3)
%%EOF
Objects
XRefStream (4) -> point to (3)
TRAILER
startxref -> point to XRefStream (4)
%%EOF
Objects
XRefStream (5) -> point to (4)
TRAILER
startxref -> point to XRefStream (5)
%%EOF

Conform reading: jump & read 5 -> jump & read 4 -> jump & read 3 -> jump & read 
2 -> jump & read 1 -> parse 1 -> parse 2 -> parse 3 -> parse 4 -> parse 5
Your order, read & parse 2 -> read & parse 1 -> read & parse 3 -> read & parse 
4 -> read & parse 5

The first two xref sections doesn't conflict so it isn't conform but will work 
without problem.

> Specification conform xref/trailer parsing + Fix
> ------------------------------------------------
>
>                 Key: PDFBOX-1016
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1016
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>         Attachments: COSDocument.diff, PDFParser.diff, 
> PDFXrefStreamParser.diff, XrefTrailerResolver.java, XrefTrailerResolver.java
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> PDFBOX currently reads xref table/trailer and XRef objects without using 
> startxref or 'Prev' information which results in applying not active data 
> resulting in using wrong objects or resulting in parsing exceptions because 
> old trailer settings do not apply anymore. This happens especially with 
> updated PDF documents where changes are simply appended  and old objects/xref 
> entries remain but are not referenced. My last patch (PDFBOX-1014) tried to 
> solve this for a specific case but it was based on assumptions which do not 
> hold in every case.
> The specification compliant way is to read the last startxref which points to 
> the last xref object which itself may reference further xref objects using 
> 'Prev' attribute.
> I have written a fix which works the standard way and can fall back to the 
> old behavior in case startxref is wrong or missing. The fix tries to be as 
> unobtrusive as possible. A new class (o.a.p.pdfparser.XrefTrailerResolver) is 
> filled with all xref table/trailer and XRef object data. After document is 
> parsed (and last startxref is read) this class creates xref table and trailer 
> using startxref and 'Prev' information. Beside this new class there are small 
> changes to PDFParser and COSDocument.
> This bugfix/improvement should bring PDFBOX a good step closer to be PDF 
> specification conform - especially as long as the new specification conform 
> parser project is not finished.
> This bugfix supersedes the fix from PDFBOX-1014.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to