[jira] [Comment Edited] (PDFBOX-5400) Page tree root must be a dictionary

Jira Fri, 25 Mar 2022 00:18:05 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512219#comment-17512219
 ]


Andreas Lehmkühler edited comment on PDFBOX-5400 at 3/25/22, 7:17 AM:
----------------------------------------------------------------------

[~Schmidor] As you already wrote, the "1 7" at the beginning oi the first 
version of the xref table leads to an off-by-one as "0 6" has to be used 
instead. So that the correct object numbers are as follows:
{code}
xref                                    (at 672274)
1 7
0000000000 65535 f                     1 free
0000000009 00000 n                      2
0000671785 00000 n                      3 (old)
0000671882 00000 n                      4 
0000672069 00000 n                      5 (old)
0000672127 00000 n                      6 (old)
0000672178 00000 n                      7 (old)
trailer
{code}
The obects 3 and 5-7 are overwritten by the new version of the xref table, so 
that the objects 2 and 4 are still broken. Before PDFBOX-5238 PDFBox detects 
that glitch and trriggers are rebuild of the xref table using a brute force 
search.

The recent changes repair the broken 2 0 reference by replacing it with the 1 0 
object. The broken 4 0 reference points to the old 3 0 object and is simply 
removed as the new repair mechanism assumes the 3 0 object is due to a valid 
reference. But the broken 4 0 key points to the old 3 0 obj and the 3 0 key 
within the xref table is the new repaired one. Those are not related and the 
repair attempt is invalid. In the end only the objects 1,  3 and 5-7 are left 
in the xref table. Object 4 is missing and leads to the exception.

To sum it up, in this case the repair attempt is incomplete due to wrong 
assumptions. The pdf is totally broken and we can't know if there are any parts 
we can rely on, so that only a brute force search may lead to a happy end



was (Author: lehmi):
[~Schmidor] As you already wrote, the "1 7" at the beginning oi the first 
version of the xref table leads to an off-by-one as "0 6" has to be used 
instead. So that the correct object numbers are as follows:
{code}
xref                                    (at 672274)
1 7
0000000000 65535 f                     1 free
0000000009 00000 n                      2
0000671785 00000 n                      3 (old)
0000671882 00000 n                      4 
0000672069 00000 n                      5 (old)
0000672127 00000 n                      6 (old)
0000672178 00000 n                      7 (old)
trailer
{code}
The obects 3 and 5-7 are overwritten by the new version of the xref table, so 
that the objects 2 and 4 are still broken. Before PDFBOX-5238 PDFBox detects 
that glitch and trriggers are rebuild of the xref table using a brute force 
search.

The recent changes repair the broken 2 0 reference by replacing it with the 1 0 
object. The broken 4 0 reference points to the old 3 0 object and is simply 
removed as the new repair mechanism assumes the 3 0 object is due to a valid 
reference. But the broken 4 0 key points to the old 3 0 obj and the 3 0 key 
within the xref table is the new repaired one. Those are not related and the 
repair attempt is invalid. In the end only the objects 1,  3 and 5-7 are left 
in the xref table. Object 4 is missing and leads to the exception.

To sum it up, in this case the repair attempt is incomplete due to wrong 
assumptions


> Page tree root must be a dictionary
> -----------------------------------
>
>                 Key: PDFBOX-5400
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5400
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.26
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: regression
>         Attachments: 4ECBGZDM5GUZG7UT75RV5GTUFWF5TSXK.pdf
>
>
> worked in 2.0.25
> {noformat}
> Caused by: java.io.IOException: Page tree root must be a dictionary
>     org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
>     org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
>     org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1107)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-5400) Page tree root must be a dictionary

Reply via email to