Pas Filip commented on PDFBOX-4182:

[~tilman] I think introducing the parameter can be useful to improve memory 
usage in the short term.

Ideally re-working the scratchfile may lead to the most gains in memory 
consumption but not as easy...



> Improve memory usage of PDFMergerUtility
> ----------------------------------------
>                 Key: PDFBOX-4182
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4182
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.9
>            Reporter: Pas Filip
>            Priority: Major
>         Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, 
> Suppliers.java, 
> failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, 
> merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, 
> oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - 
> refactored-merge-utility-4gb-heap-2618-files-merged.png, successful 
> -merge-utility-6gb-heap-2618-files-merged.png, 
> successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, 
> successful-merge-utility-8gb-heap-2618-files-merged.png, 
> successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge 
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd also suggest that there should be further improvements made in memory 
> usage in general as pdfbox seems to consumer a lot of memory in general.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to