[jira] [Commented] (PDFBOX-5602) Consider adding support for PDF files Concatenation in addition to the full Merge

Tilman Hausherr (Jira) Sat, 13 May 2023 10:36:05 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722383#comment-17722383
 ]


Tilman Hausherr commented on PDFBOX-5602:
-----------------------------------------

Ideally PDFs are compressed, so loading them might use much more space.
A table of contents page or outlines are something different than the structure 
tree. The structure tree divides pages in different segments which helps screen 
readers (and much more e.g. tables). I don't know if you have a structure tree 
at all. Use PDFDebugger and switch to "Show internal structure" in the menu.
1) Yes if all of the PDF will be needed. 3.0.0 does parse on demand so deleting 
the structure tree before merging might help. (2.0.28 always loads all)
2) It may happen
3) This is very long. Could it be that it you have some limit on memory usage? 
It would be interesting to get such files for testing.
4) To the target PDDocument. You'd need to save yourself.


> Consider adding support for PDF files Concatenation in addition to the  full 
> Merge
> ----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5602
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5602
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Utilities
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Zbigniew Minciel
>            Priority: Major
>
> I decided to evaluate pdfbox 3.0.0-alpha3 limits on merging large number of 
> PDF files.
> I attempted to merge 7500 mails in separate PDF files on Windows. Given the 
> limitation on the max size of the command line arguments, I was merging 
> subsets of files. I ended up with 5 large PDF files, each around 
> 500-600MBytes. I tried to merge these 5 files but eventually merge failed 
> after running more than 6 hours.  See error log at the bottom. I have large 
> RAM 48GBytes.  PDFBox was using up 13GB of memory max. Usage was changing 
> between 600MB and 13Gb. 
> I am wondering whether PDFBox could support Concatenation mode in addition to 
> the full Merge mode.  No need to create index table, etc. It could work as 
> follow I suppose given my total lack of understanding how PDF works:
>  # Read first file, process and append to the target PDF file. Delete PDF 
> data and related meta data for this file except perhaps the last page number.
>  # Read the second file and process in similar fashion as in the step 1
>  # etc
> If Concatenation is possible, it would greatly reduce the cpu and memory 
> overhead and reduce processing time.
> I admit merging of such large number of PDF files is not typical but the 
> issue is valid.
> ^CException in thread "main" java.lang.OutOfMemoryError: Java heap space
>     at java.base/java.util.Hashtable.rehash(Hashtable.java:419)
>     at java.base/java.util.Hashtable.addEntry(Hashtable.java:441)
>     at java.base/java.util.Hashtable.put(Hashtable.java:493)
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.doWriteBodyCompressed(COSWriter.java:481)
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1260)
>     at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:402)
>     at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1542)
>     at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1418)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1018)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:963)
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:982)
>     at 
> org.apache.pdfbox.multipdf.PDFMergerUtility.legacyMergeDocuments(PDFMergerUtility.java:476)
>     at 
> org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:355)
>     at 
> org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:339)
>     at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:76)
>     at org.apache.pdfbox.tools.PDFMerger.call(PDFMerger.java:37)
>     at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
>     at picocli.CommandLine.access$1300(CommandLine.java:145)
>     at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2358)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2352)
>     at picocli.CommandLine$RunLast.handle(CommandLine.java:2314)
>     at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
>     at picocli.CommandLine$RunLast.execute(CommandLine.java:2316)
>     at picocli.CommandLine.execute(CommandLine.java:2078)
>     at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76)
> Respectfully,
> Zbigniew
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5602) Consider adding support for PDF files Concatenation in addition to the full Merge

Reply via email to