Maruan Sahyoun commented on PDFBOX-4182:

I like the idea of the patch to use a strategy which allows to select the mode 
of handling at which point in time documents are closed. But what I really 
would like to do is to come up with a new merge behind the scenes which 
initially doesn't support merging all of the elements which are currently 
supported but reuses or rewrites how we handle different elements to allow us 
to gradually resolve the open issues and generally allow to close a document 
after is has been merged. So instead of calling the merge strategies after how 
we close documents I'd rather go for names which do not reflect the inner 
workings. As you've written above implementing the patch helps improving the 
situation for documents where we know that they can be handled by closing the 
document directly after the merge but doesn't resolve the issues for the ones 
where it doesn't work.

My proposal would be to have basically two mergeDocuments methods (although 
they might be called differently) - one for doing legacy merge i.e. the current 
mode of operation and one with a new implementation where we add capabilities 
over time.

> Improve memory usage of PDFMergerUtility
> ----------------------------------------
>                 Key: PDFBOX-4182
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4182
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.9
>            Reporter: Pas Filip
>            Priority: Major
>         Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, 
> Suppliers.java, 
> failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, 
> merge-pdf-stats.xlsx, merge-utility.patch, 
> oom-2gb-heap-after-refactoring-leak-suspect-1.png, 
> oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - 
> refactored-merge-utility-4gb-heap-2618-files-merged.png, successful 
> -merge-utility-6gb-heap-2618-files-merged.png, 
> successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, 
> successful-merge-utility-8gb-heap-2618-files-merged.png, 
> successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge 
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd also suggest that there should be further improvements made in memory 
> usage in general as pdfbox seems to consumer a lot of memory in general.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to