Hi,

Please have also a look at the comments in https://issues.apache.org/jira/browse/PDFBOX-4182 .

Please submit your patch proposal there or in a new issue. It should be against the trunk. Note that this doesn't mean your patch will be accepted, it just means I'd like to see it because I haven't understood your post fully, and many attachment types don't get through here.

A breaking test would be interesting: is it possible to use (or better, create) 400 identical small PDFs and merge them and does it break?

Tilman

Am 06.04.2018 um 23:10 schrieb Gary Potagal:
Hello,

Thank you again for a great library.

We wanted to address one more merge issue in 
org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
We need to merge a large number of small files.  We use mixed mode, memory and disk for cache.  
Initially, we would often get "Maximum allowed scratch file memory exceeded.", unless we 
turned off the check by passing "-1" to 
org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this is what the users that 
opened https://issues.apache.org/jira/browse/PDFBOX-3721 where running into.

Our research indicates that the core issue with the memory model is that 
instead of sharing a single cache, it breaks it up into equal sized fixed 
partitions based on the number of input + output files being merged.  This 
means that each partition must be big enough to hold the final output file.  
When 400 1-page files are merged, this creates 401 partitions, but each of 
which needs to be big enough to hold the final 400 pages.  Even worse, the 
merge algorithm needs to keep all files open until the end.

Given this, near the end of the merge, we're actually caching 400 x 1-page 
input files, and 1 x 400-page output file, or 801 pages.
However, with the partitioned cache, we need to declare room for 401  x 400-pages, or 
160,400 pages in total when specifying "maxStorageBytes".  This would be a very 
high number, usually in GIGs.

Given the current limitation that we need to keep all the input files open 
until the output file is written (HUGE), we came up with 2 options.


   1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
other ½ across the input files. (Still keeping them open until then end).
   2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
on demand, release cache as documents are closed after merge.  This is our 
current implementation till PDFBOX-3999 is addressed.

We would like to submit our current implementation as a Patch to 2.0.10, unless 
this is already addressed.

Thank you

This email, including attachments, may contain information that is privileged, 
confidential or is exempt from disclosure under applicable law (including, but 
not limited to, protected health information). It is not intended for 
transmission to, or receipt by, any unauthorized persons. If the reader of this 
message is not the intended recipient, or the employee or agent responsible for 
delivering the message to the intended recipient, you are hereby notified that 
any dissemination, distribution or copying of this communication is strictly 
prohibited. If you believe this email was sent to you in error, do not read it. 
Please notify the sender immediately informing them of the error and delete all 
copies and attachments of the message from your system. Thank you.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to