[jira] [Commented] (PDFBOX-6009) Splitter does not include structure tree in documents past the first split

Tilman Hausherr (Jira) Fri, 16 May 2025 06:53:05 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-6009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952129#comment-17952129
 ]


Tilman Hausherr commented on PDFBOX-6009:
-----------------------------------------

Currently when cloning it ignores any structure elements with a /Pg entry with 
a page that don't belong to the one in the target. In the attached PDF the top 
/K element is page 1. Thus it will ignore that element when the destination 
doesn't contain page 1. Maybe this algorithm will have to be refined, e.g. only 
remove the page but not the element itself if it is not at a "leaf". The 
description of /Pg is "a page object representing a page on which some or all 
of the content items designated by the K entry shall be rendered". PAC 
complains if there is a structure element with MCIDs but no /Pg entry so for 
now, I'll remove elements without /Pg entry if they have at least one MCID (= 
an integer).

> Splitter does not include structure tree in documents past the first split
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-6009
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6009
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.0.34, 3.0.5 PDFBox
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: StructureTree
>         Attachments: pdfbox-split-missing-tags_mail 15.5.2025-p1.pdf, 
> pdfbox-split-missing-tags_mail 15.5.2025-p2.pdf, 
> pdfbox-split-missing-tags_mail 15.5.2025-p3.pdf, 
> pdfbox-split-missing-tags_mail 15.5.2025.pdf
>
>
> As submitted by Alastair Porter in the users mailing list
> java -jar pdfbox/app/target/pdfbox-app-4.0.0-SNAPSHOT.jar split -i input.pdf 
> -outputPrefix output-split
> Only first page has the appropriate structure tree (/K is missing)
> === from the post in the mailing list ===
> In the first file, I correctly see the /K element. What's more, this element 
> has correctly been pruned and doesn't include any items from the input 
> document which point to pages that are not in this split.
> In subsequent split files, I see no /K element in the StructTreeRoot at all.
> I attached a PDF which I've been using for simple testing, which exhibits 
> this behaviour.
> I had a bit of a look through the existing code, and I see that in 
> Splitter.java, in cloneStructureTree
> {code:java}
> COSBase k1 = srcStructureTreeRoot.getK();
> COSBase k2 = new KCloner(dstPageTree).createClone(k1, 
> dstStructureTreeRoot.getCOSObject(), null);
> dstStructureTreeRoot.setK(k2);
> {code}
> k2 is always null after the first split, it seems like it may not be created 
> correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-6009) Splitter does not include structure tree in documents past the first split

Reply via email to