[
https://issues.apache.org/jira/browse/PDFBOX-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512219#comment-17512219
]
Andreas Lehmkühler edited comment on PDFBOX-5400 at 3/25/22, 7:17 AM:
----------------------------------------------------------------------
[~Schmidor] As you already wrote, the "1 7" at the beginning oi the first
version of the xref table leads to an off-by-one as "0 6" has to be used
instead. So that the correct object numbers are as follows:
{code}
xref (at 672274)
1 7
0000000000 65535 f 1 free
0000000009 00000 n 2
0000671785 00000 n 3 (old)
0000671882 00000 n 4
0000672069 00000 n 5 (old)
0000672127 00000 n 6 (old)
0000672178 00000 n 7 (old)
trailer
{code}
The obects 3 and 5-7 are overwritten by the new version of the xref table, so
that the objects 2 and 4 are still broken. Before PDFBOX-5238 PDFBox detects
that glitch and trriggers are rebuild of the xref table using a brute force
search.
The recent changes repair the broken 2 0 reference by replacing it with the 1 0
object. The broken 4 0 reference points to the old 3 0 object and is simply
removed as the new repair mechanism assumes the 3 0 object is due to a valid
reference. But the broken 4 0 key points to the old 3 0 obj and the 3 0 key
within the xref table is the new repaired one. Those are not related and the
repair attempt is invalid. In the end only the objects 1, 3 and 5-7 are left
in the xref table. Object 4 is missing and leads to the exception.
To sum it up, in this case the repair attempt is incomplete due to wrong
assumptions. The pdf is totally broken and we can't know if there are any parts
we can rely on, so that only a brute force search may lead to a happy end
was (Author: lehmi):
[~Schmidor] As you already wrote, the "1 7" at the beginning oi the first
version of the xref table leads to an off-by-one as "0 6" has to be used
instead. So that the correct object numbers are as follows:
{code}
xref (at 672274)
1 7
0000000000 65535 f 1 free
0000000009 00000 n 2
0000671785 00000 n 3 (old)
0000671882 00000 n 4
0000672069 00000 n 5 (old)
0000672127 00000 n 6 (old)
0000672178 00000 n 7 (old)
trailer
{code}
The obects 3 and 5-7 are overwritten by the new version of the xref table, so
that the objects 2 and 4 are still broken. Before PDFBOX-5238 PDFBox detects
that glitch and trriggers are rebuild of the xref table using a brute force
search.
The recent changes repair the broken 2 0 reference by replacing it with the 1 0
object. The broken 4 0 reference points to the old 3 0 object and is simply
removed as the new repair mechanism assumes the 3 0 object is due to a valid
reference. But the broken 4 0 key points to the old 3 0 obj and the 3 0 key
within the xref table is the new repaired one. Those are not related and the
repair attempt is invalid. In the end only the objects 1, 3 and 5-7 are left
in the xref table. Object 4 is missing and leads to the exception.
To sum it up, in this case the repair attempt is incomplete due to wrong
assumptions
> Page tree root must be a dictionary
> -----------------------------------
>
> Key: PDFBOX-5400
> URL: https://issues.apache.org/jira/browse/PDFBOX-5400
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.26
> Reporter: Tilman Hausherr
> Priority: Major
> Labels: regression
> Attachments: 4ECBGZDM5GUZG7UT75RV5GTUFWF5TSXK.pdf
>
>
> worked in 2.0.25
> {noformat}
> Caused by: java.io.IOException: Page tree root must be a dictionary
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1107)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]