Gary Potagal updated PDFBOX-4188:
    Affects Version/s: 3.0.0 PDFBox

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --------------------------------------------------------------------------------------------------
>                 Key: PDFBOX-4188
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4188
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.9, 3.0.0 PDFBox
>            Reporter: Gary Potagal
>            Priority: Major
> I have been running some tests trying to merge large amounts (2618) of small 
> pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)
> Memory consumption seems to be the main limitation.
> ScratchFileBuffer seems to consume the majority of the memory usage.
> (see screenshot from mat in attachment)
> (I would include the hprof in attachment so you can analyze yourselves but 
> it's rather large)
> Note that it seems impossible to generate a large pdf using a small memory 
> footprint.
> I personally thought that using MemorySettings with temporary file only would 
> allow me to generate arbitrarily large pdf files but it doesn't seem to help.
> I've run the mergeDocuments with  memory settings:
>  * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L 
> * 1024L)
>  * MemoryUsageSetting.setupTempFileOnly()
> Refactored version completes with *4GB* heap:
> with temp file only completes 2618 documents in 1.760 min
> *VS*
> *8GB* heap:
> with temp file only completes 2618 documents in 2.0 min
> Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB 
> and 8GB)
>  It looks like the loop in the mergeDocuments accumulates PDDocument objects 
> in a list which are closed after the merge is completed.
> Refactoring the code to close these as they are used, instead of accumulating 
> them and closing all at the end, improves memory usage considerably.(although 
> doesn't seem to be eliminated completed based on mat analysis.)
> Another change I've implemented is to only create the inputstream when the 
> file needs to be read and to close it alongside the PDDocument.
> (Some inputstreams contain buffers and depending on the size of the buffers 
> and or the stream type accumulating all the streams is a potential 
> memory-hog.)
> These changes seems to have a beneficial improvement in the sense that I can 
> process the same amount of pdfs with about half the memory.
>  I'd appreciate it if you could roll these changes into the main codebase.
> (I've respected java 6 compatibility.)
> I've included in attachment the java files of the new implementation:
>  * Suppliers
>  * Supplier
>  * PDFMergerUtilityUsingSupplier
> PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
> changes only internal code changes. (just rename the class to 
> PDFMergerUtility if you decide to implemented the changes.)
>  In attachment you can also find some screenshots from visualvm showing the 
> memory usage of the original version and the refactored version as well as 
> some info produced by mat after analysing the heap.
> If you know of any other means, without running into memory issues, to merge 
> large sets of pdf files into a large single pdf I'd love to hear about it!
> I'd also suggest that there should be further improvements made in memory 
> usage in general as pdfbox seems to consumer a lot of memory in general.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to