[jira] [Commented] (PDFBOX-5602) Consider adding support for PDF files Concatenation in addition to the full Merge

Zbigniew Minciel (Jira) Mon, 15 May 2023 15:08:20 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722939#comment-17722939
 ]


Zbigniew Minciel commented on PDFBOX-5602:
------------------------------------------

 

Did more testing, summary below.
h1. *Failure when merging all 5 large files, each 500-600MB*

I managed to resolve the failure by increase the maximum size of the VM heap to 
24GB:

     java -Xms24G -Xmx24G

However, it took *369* minutes to complete.
h1. *Running without the structure tree*

{{ Per your suggestion I made the following updates to delete the structure 
tree.}}{{ }}

{{  //destCatalog.setStructureTreeRoot({*}destStructTree{*});}}
{{    destCatalog.setStructureTreeRoot({*}null{*});}}

 

MVN Build of the pdfbox source failed due to the failure of 
PDFMergerUtilityTest.java, see details at the bottom.

Imported pdfbox code into Eclipse. Managed to run PDFMergerUtility and export 
as the jar file.

Did run the large merge with the default and enlarged VM heap space. Both runs 
were successful.  Running time was reduced significantly:

 

with -Xms24G -Xmx24G:   24 minutes

with default heap size:   29 minutes

 

See hot spots statistics (at the bottom) for 3.0.0-apha3 and 3.0.0-SNAPSHOT  I 
collected running visualVM. They obviously differ. Also, note that I don't see 
anymore heavy log of the below. These logs could also contribute to the 
problem. I think the log messages are always created and can be logged 
depending of the log level.

 May 15, 2023 8:00:02 AM org.apache.pdfbox.multipdf.PDFMergerUtility mergeIDTree
WARNING: key node00001371 already exists in destination IDTree
h3. +Hope you consider adding  an option to the official release to help to 
deal with such case as described.+

 
h1. *MVN Build Failure*

[ERROR] testStructureTreeMerge7  Time elapsed: 0.02 s  <<< ERROR!
java.lang.NullPointerException: Cannot invoke 
"org.apache.pdfbox.pdmodel.documentinterchange.logicalstructure.PDStructureTreeRoot.getParentTree()"
 because the return value of 
"org.apache.pdfbox.pdmodel.PDDocumentCatalog.getStructureTreeRoot()" is null
    at 
org.apache.pdfbox.multipdf.PDFMergerUtilityTest.checkWithNumberTree(PDFMergerUtilityTest.java:642)
    at 
org.apache.pdfbox.multipdf.PDFMergerUtilityTest.testStructureTreeMerge7(PDFMergerUtilityTest.java:471)

[INFO] Running org.apache.pdfbox.pdmodel.graphics.image.LosslessFactoryTest
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.353 s 
- in org.apache.pdfbox.pdmodel.graphics.image.CCITTFactoryTest
[INFO] Running org.apache.pdfbox.pdmodel.graphics.image.PDImageXObjectTest
[ERROR] Tests run: 33, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.231 
s <<< FAILURE! - in org.apache.pdfbox.pdmodel.graphics.image.PDImageXObjectTest
[ERROR] testMergeBogusStructParents1  Time elapsed: 0 s  <<< ERROR!
java.lang.NullPointerException: Cannot invoke 
"org.apache.pdfbox.pdmodel.documentinterchange.logicalstructure.PDStructureTreeRoot.getParentTree()"
 because the return value of 
"org.apache.pdfbox.pdmodel.PDDocumentCatalog.getStructureTreeRoot()" is null
    at 
org.apache.pdfbox.multipdf.PDFMergerUtilityTest.checkWithNumberTree(PDFMergerUtilityTest.java:642)
    at 
org.apache.pdfbox.multipdf.PDFMergerUtilityTest.testMergeBogusStructParents1(PDFMergerUtilityTest.java:572)
h1. *Hot Spots*

!cpu-hot-spots-3.0.0-alpha3.PNG!

 

!cpu-hot-spots-3.0.0-SNAPSHOT.PNG!

 

 

 

 

 

 

 

 

> Consider adding support for PDF files Concatenation in addition to the  full 
> Merge
> ----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5602
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5602
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Utilities
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Zbigniew Minciel
>            Priority: Major
>         Attachments: CapturePdfDebugger.PNG, Large527MbytesPDF.PNG, 
> cpu-hot-spots-3.0.0-SNAPSHOT.PNG, cpu-hot-spots-3.0.0-alpha3.PNG
>
>
> I decided to evaluate pdfbox 3.0.0-alpha3 limits on merging large number of 
> PDF files.
> I attempted to merge 7500 mails in separate PDF files on Windows. Given the 
> limitation on the max size of the command line arguments, I was merging 
> subsets of files. I ended up with 5 large PDF files, each around 
> 500-600MBytes. I tried to merge these 5 files but eventually merge failed 
> after running more than 6 hours.  See error log at the bottom. I have large 
> RAM 48GBytes.  PDFBox was using up 13GB of memory max. Usage was changing 
> between 600MB and 13Gb. 
> I am wondering whether PDFBox could support Concatenation mode in addition to 
> the full Merge mode.  No need to create index table, etc. It could work as 
> follow I suppose given my total lack of understanding how PDF works:
>  # Read first file, process and append to the target PDF file. Delete PDF 
> data and related meta data for this file except perhaps the last page number.
>  # Read the second file and process in similar fashion as in the step 1
>  # etc
> If Concatenation is possible, it would greatly reduce the cpu and memory 
> overhead and reduce processing time.
> I admit merging of such large number of PDF files is not typical but the 
> issue is valid.
> ^CException in thread "main" java.lang.OutOfMemoryError: Java heap space
>     at java.base/java.util.Hashtable.rehash(Hashtable.java:419)
>     at java.base/java.util.Hashtable.addEntry(Hashtable.java:441)
>     at java.base/java.util.Hashtable.put(Hashtable.java:493)
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.doWriteBodyCompressed(COSWriter.java:481)
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1260)
>     at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:402)
>     at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1542)
>     at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1418)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1018)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:963)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:982)
>     at 
> org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:476)
>     at 
> org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:355)
>     at 
> org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:339)
>     at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:76)
>     at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:37)
>     at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
>     at picocli.CommandLine.access$1300(CommandLine.java:145)
>     at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
>     at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
>     at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
>     at picocli.CommandLine.execute(CommandLine.java:2078)
>     at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76)
> Respectfully,
> Zbigniew
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5602) Consider adding support for PDF files Concatenation in addition to the full Merge

Reply via email to