[ 
https://issues.apache.org/jira/browse/PDFBOX-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512219#comment-17512219
 ] 

Andreas Lehmkühler commented on PDFBOX-5400:
--------------------------------------------

[~Schmidor] As you already wrote, the "1 7" at the beginning oi the first 
version of the xref table leads to an off-by-one as "0 6" has to be used 
instead. So that the correct object numbers are as follows:
{code}
xref                                    (at 672274)
1 7
0000000000 65535 f                     1 free
0000000009 00000 n                      2
0000671785 00000 n                      3 (old)
0000671882 00000 n                      4 
0000672069 00000 n                      5 (old)
0000672127 00000 n                      6 (old)
0000672178 00000 n                      7 (old)
trailer
{code}
The obects 3 and 5-7 are overwritten by the new version of the xref table, so 
that the objects 2 and 4 are still broken. Before PDFBOX-5238 PDFBox detects 
that glitch and trriggers are rebuild of the xref table using a brute force 
search.

The recent changes repair the broken 2 0 reference by replacing it with the 1 0 
object. The broken 4 0 reference points to the old 3 0 object and is simply 
removed as the new repair mechanism assumes the 3 0 object is due to a valid 
reference. But the broken 4 0 key points to the old 3 0 obj and the 3 0 key 
within the xref table is the new repaired one. Those are not related and the 
repair attempt is invalid. In the end only the objects 1,  3 and 5-7 are left 
in the xref table. Object 4 is missing and leads to the exception.

To sum it up, in this case the repair attempt is incomplete due to wrong 
assumptions


> Page tree root must be a dictionary
> -----------------------------------
>
>                 Key: PDFBOX-5400
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5400
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.26
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: regression
>         Attachments: 4ECBGZDM5GUZG7UT75RV5GTUFWF5TSXK.pdf
>
>
> worked in 2.0.25
> {noformat}
> Caused by: java.io.IOException: Page tree root must be a dictionary
>     org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
>     org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
>     org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1107)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to