[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Potagal updated PDFBOX-4188:
---------------------------------
    Description: 
 

Am 06.04.2018 um 23:10 schrieb Gary Potagal:

 

We wanted to address one more merge issue in 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).

We need to merge a large number of small files.  We use mixed mode, memory and 
disk for cache.  Initially, we would often get "Maximum allowed scratch file 
memory exceeded.", unless we turned off the check by passing "-1" to 
org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this is 
what the users that opened

https://issues.apache.org/jira/browse/PDFBOX-3721 

where running into.

Our research indicates that the core issue with the memory model is that 
instead of sharing a single cache, it breaks it up into equal sized fixed 
partitions based on the number of input + output files being merged.  This 
means that each partition must be big enough to hold the final output file.  
When 400 1-page files are merged, this creates 401 partitions, but each of 
which needs to be big enough to hold the final 400 pages.  Even worse, the 
merge algorithm needs to keep all files open until the end.

Given this, near the end of the merge, we're actually caching 400 x 1-page 
input files, and 1 x 400-page output file, or 801 pages.

However, with the partitioned cache, we need to declare room for 401  x 
400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
would be a very high number, usually in GIGs.

 

Given the current limitation that we need to keep all the input files open 
until the output file is written (HUGE), we came up with 2 options.  See 
(https://issues.apache.org/jira/browse/PDFBOX-4182)  

 

1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
other ½ across the input files. (Still keeping them open until then end).

2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk on 
demand, release cache as documents are closed after merge.  This is our current 
implementation till PDFBOX-3999 is addressed.

 

We would like to submit our current implementation as a Patch to 2.0.10 and 
3.0.0, unless this is already addressed.

 

 Thank you

  was:
I have been running some tests trying to merge large amounts (2618) of small 
pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb)

Memory consumption seems to be the main limitation.

ScratchFileBuffer seems to consume the majority of the memory usage.

(see screenshot from mat in attachment)

(I would include the hprof in attachment so you can analyze yourselves but it's 
rather large)

Note that it seems impossible to generate a large pdf using a small memory 
footprint.

I personally thought that using MemorySettings with temporary file only would 
allow me to generate arbitrarily large pdf files but it doesn't seem to help.

I've run the mergeDocuments with  memory settings:
 * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L * 
1024L)

 * MemoryUsageSetting.setupTempFileOnly()

Refactored version completes with *4GB* heap:

with temp file only completes 2618 documents in 1.760 min

*VS*

*8GB* heap:

with temp file only completes 2618 documents in 2.0 min

Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB and 
8GB)

 It looks like the loop in the mergeDocuments accumulates PDDocument objects in 
a list which are closed after the merge is completed.

Refactoring the code to close these as they are used, instead of accumulating 
them and closing all at the end, improves memory usage considerably.(although 
doesn't seem to be eliminated completed based on mat analysis.)

Another change I've implemented is to only create the inputstream when the file 
needs to be read and to close it alongside the PDDocument.

(Some inputstreams contain buffers and depending on the size of the buffers and 
or the stream type accumulating all the streams is a potential memory-hog.)

These changes seems to have a beneficial improvement in the sense that I can 
process the same amount of pdfs with about half the memory.

 I'd appreciate it if you could roll these changes into the main codebase.

(I've respected java 6 compatibility.)

I've included in attachment the java files of the new implementation:
 * Suppliers
 * Supplier
 * PDFMergerUtilityUsingSupplier

PDFMergerUtilityUsingSupplier can replace the previous version. No signature 
changes only internal code changes. (just rename the class to PDFMergerUtility 
if you decide to implemented the changes.)

 In attachment you can also find some screenshots from visualvm showing the 
memory usage of the original version and the refactored version as well as some 
info produced by mat after analysing the heap.

If you know of any other means, without running into memory issues, to merge 
large sets of pdf files into a large single pdf I'd love to hear about it!

I'd also suggest that there should be further improvements made in memory usage 
in general as pdfbox seems to consumer a lot of memory in general.


>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4188
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4188
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.9, 3.0.0 PDFBox
>            Reporter: Gary Potagal
>            Priority: Major
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened
> https://issues.apache.org/jira/browse/PDFBOX-3721 
> where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  See 
> (https://issues.apache.org/jira/browse/PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999 is addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to