[
https://issues.apache.org/jira/browse/PDFBOX-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512219#comment-17512219
]
Andreas Lehmkühler commented on PDFBOX-5400:
--------------------------------------------
[~Schmidor] As you already wrote, the "1 7" at the beginning oi the first
version of the xref table leads to an off-by-one as "0 6" has to be used
instead. So that the correct object numbers are as follows:
{code}
xref (at 672274)
1 7
0000000000 65535 f 1 free
0000000009 00000 n 2
0000671785 00000 n 3 (old)
0000671882 00000 n 4
0000672069 00000 n 5 (old)
0000672127 00000 n 6 (old)
0000672178 00000 n 7 (old)
trailer
{code}
The obects 3 and 5-7 are overwritten by the new version of the xref table, so
that the objects 2 and 4 are still broken. Before PDFBOX-5238 PDFBox detects
that glitch and trriggers are rebuild of the xref table using a brute force
search.
The recent changes repair the broken 2 0 reference by replacing it with the 1 0
object. The broken 4 0 reference points to the old 3 0 object and is simply
removed as the new repair mechanism assumes the 3 0 object is due to a valid
reference. But the broken 4 0 key points to the old 3 0 obj and the 3 0 key
within the xref table is the new repaired one. Those are not related and the
repair attempt is invalid. In the end only the objects 1, 3 and 5-7 are left
in the xref table. Object 4 is missing and leads to the exception.
To sum it up, in this case the repair attempt is incomplete due to wrong
assumptions
> Page tree root must be a dictionary
> -----------------------------------
>
> Key: PDFBOX-5400
> URL: https://issues.apache.org/jira/browse/PDFBOX-5400
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.26
> Reporter: Tilman Hausherr
> Priority: Major
> Labels: regression
> Attachments: 4ECBGZDM5GUZG7UT75RV5GTUFWF5TSXK.pdf
>
>
> worked in 2.0.25
> {noformat}
> Caused by: java.io.IOException: Page tree root must be a dictionary
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1107)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]