[jira] [Comment Edited] (PDFBOX-4007) Merged documents don't retain tags

Dave Hill (JIRA) Tue, 09 Jan 2018 12:22:37 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-4007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267644#comment-16267644
 ]


Dave Hill edited comment on PDFBOX-4007 at 1/9/18 8:21 PM:
-----------------------------------------------------------

Today when I tried to demonstrate Adobe crashing I used the development trunk 
and the tags were not right but Adobe DC did not crash when I viewed tags.

We think we understand what your test is trying to demonstrate, there are 
duplicate /Type /Pages pointing to duplicate /Type /Page. The appended document 
tags point to these orphan pages instead of to the correct pages, which only 
makes the tags appear to work correctly.

I can get your test back to greenbar with a variety of test files by making 
this change to the patched PDFMergerutility:504

            
{code:java}
for (int i = 0; i < srcNumbersArray.size() / 2; i++) {
                destNumbersArray.add(COSInteger.get(destParentTreeNextKey + i));
                destNumbersArray.add(srcNumbersArray.getObject(i * 2 + 1));     
 // in my patch it was   
destNumbersArray.add(cloner.cloneForNewDocument(srcNumbersArray.getObject(i * 2 
+ 1)));
            }
{code}

But it's green for all the wrong reasons. When we dig into the output we still 
find orphaned pages, it looks like they are just more successfully orphaned 
because the NumbersArray points to incorrect objects and more effectively 
orphaned the duplicates but this code change is obviously going in the wrong 
direction despite it being "green"

Could we get a red test with the patch and the above changes in place?

Also uploading a "HelloWorldTagged.pdf" that was created by hand and is very 
human readable but which also reproduces the red test that you had created. 
This file is much easier to debug through than the government file.



was (Author: davesplanet):
Today when I tried to demonstrate Adobe crashing I used the development trunk 
and the tags were not right but Adobe DC did not crash when I viewed tags.

We think we understand what your test is trying to demonstrate, there are 
duplicate /Type /Pages pointing to duplicate /Type /Page. The appended document 
tags point to these orphan pages instead of to the correct pages, which only 
makes the tags appear to work correctly.

I can get your test back to greenbar with a variety of test files by making 
this change to the patched PDFMergerutility:504

            for (int i = 0; i < srcNumbersArray.size() / 2; i++) {
                destNumbersArray.add(COSInteger.get(destParentTreeNextKey + i));
                destNumbersArray.add(srcNumbersArray.getObject(i * 2 + 1));     
 // in my patch it was   
destNumbersArray.add(cloner.cloneForNewDocument(srcNumbersArray.getObject(i * 2 
+ 1)));
            }

But it's green for all the wrong reasons. When we dig into the output we still 
find orphaned pages, it looks like they are just more successfully orphaned 
because the NumbersArray points to incorrect objects and more effectively 
orphaned the duplicates but this code change is obviously going in the wrong 
direction despite it being "green"

Could we get a red test with the patch and the above changes in place?

Also uploading a "HelloWorldTagged.pdf" that was created by hand and is very 
human readable but which also reproduces the red test that you had created. 
This file is much easier to debug through than the government file.


> Merged documents don't retain tags
> ----------------------------------
>
>                 Key: PDFBOX-4007
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4007
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.0.8
>            Reporter: Dave Hill
>            Priority: Minor
>              Labels: StructureTree, merge
>         Attachments: HelloWorldTagged.pdf, PDFMergeUtility-2.patch, 
> PDFMergeUtility.patch, Tagged+GeneralForbearance-Merged.pdf, Tagged.pdf
>
>
> Certain combinations of documents don't retain tags when merged. The document 
> [^Tagged.pdf] is just a basic one word PDF created and tagged with Pro DC. If 
> you try to merge this with the government [General Forbearance 
> form|https://studentloans.gov/myDirectLoan/downloadForm.action?searchType=library&shortName=general&localeCode=en-us]
>  the output crashes DC when you try to view the tags. If you use a flattened 
> version of the General Forbearance form then the tags are just munged.
> {code}
>     public static void main(String[] args) throws Exception {
>         PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
>         PDDocument src = PDDocument.load(new File("Tagged.pdf"));
>         PDDocument dest = PDDocument.load(new File("GeneralForbearance.pdf"));
>         pdfMergerUtility.appendDocument(dest, src);
>         src.close();
>         dest.save(new File("BrokenTags.pdf"));
>         dest.close();
>     }
> {code}
> The included patch appears to make tagging more reliable, but I'm still 
> relying heavily on cloning which can apparently cause other issues.  The 
> documents I get out with this code seem present correctly in Adobe readers 
> for all combinations of documents that I tested against.
> My patch is made and tested against yesterdays production head and it 
> includes my changes from 
> [PDFBOX-3999|https://issues.apache.org/jira/browse/PDFBOX-3999] since it is 
> in the exact same place in the code.
> The priority of this is a blocker for 508 compliance of merged documents but 
> I guessed it to be more of a minor issue in the overall scheme of things, 
> please correct me if I am mistaken.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-4007) Merged documents don't retain tags

Reply via email to