Specification conform xref/trailer parsing + Fix
------------------------------------------------

                 Key: PDFBOX-1016
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1016
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 1.6.0
            Reporter: Timo Boehme


PDFBOX currently reads xref table/trailer and XRef objects without using 
startxref or 'Prev' information which results in applying not active data 
resulting in using wrong objects or resulting in parsing exceptions because old 
trailer settings do not apply anymore. This happens especially with updated PDF 
documents where changes are simply appended  and old objects/xref entries 
remain but are not referenced. My last patch (PDFBOX-1014) tried to solve this 
for a specific case but it was based on assumptions which do not hold in every 
case.

The specification compliant way is to read the last startxref which points to 
the last xref object which itself may reference further xref objects using 
'Prev' attribute.

I have written a fix which works the standard way and can fall back to the old 
behavior in case startxref is wrong or missing. The fix tries to be as 
unobtrusive as possible. A new class (o.a.p.pdfparser.XrefTrailerResolver) is 
filled with all xref table/trailer and XRef object data. After document is 
parsed (and last startxref is read) this class creates xref table and trailer 
using startxref and 'Prev' information. Beside this new class there are small 
changes to PDFParser and COSDocument.

This bugfix/improvement should bring PDFBOX a good step closer to be PDF 
specification conform - especially as long as the new specification conform 
parser project is not finished.

This bugfix supersedes the fix from PDFBOX-1014.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to