[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435312#comment-16435312 ]
Maruan Sahyoun edited comment on PDFBOX-4188 at 4/12/18 10:31 AM: ------------------------------------------------------------------ with [^PDFMergerUtility.java-20180412.patch] these are the results: {noformat} INFORMATION: Test Name: pdf_sample_1-100pages; Files: 100; Pages: 100; Time(s): 4,112; Pages/Second: 24,319; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 1; Total Sources Size(K): 775; Merged File Size(K): 518; Ratio MaxStorageBytes/Merged File Size: 1 INFORMATION: Test Name: pdf_sample_1-200pages; Files: 200; Pages: 200; Time(s): 3,481; Pages/Second: 57,455; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 1; Total Sources Size(K): 1.551; Merged File Size(K): 1.038; Ratio MaxStorageBytes/Merged File Size: 0 INFORMATION: Test Name: pdf_sample_1-300pages; Files: 300; Pages: 300; Time(s): 3,746; Pages/Second: 80,085; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 2; Total Sources Size(K): 2.327; Merged File Size(K): 1.558; Ratio MaxStorageBytes/Merged File Size: 1 INFORMATION: Test Name: pdf_sample_1-400pages; Files: 400; Pages: 400; Time(s): 4,959; Pages/Second: 80,661; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 4; Total Sources Size(K): 3.103; Merged File Size(K): 2.078; Ratio MaxStorageBytes/Merged File Size: 1 INFORMATION: Summary: Pages: 1000, Time(s): 16,298, Pages/Second: 61,357 {noformat} which I was able to run with the following settings {noformat} runMergeTest("pdf_sample_1-100pages", defaultMemory, 1 * MEG); runMergeTest("pdf_sample_1-200pages", defaultMemory, 1 * MEG); runMergeTest("pdf_sample_1-300pages", defaultMemory, 2 * MEG); runMergeTest("pdf_sample_1-400pages", defaultMemory, 4 * MEG); {noformat} Of course this is a quick and dirty implementation/test to verify that closing early will bring the requirements down. was (Author: msahyoun): with [^PDFMergerUtility.java-20180412.patch] these are the results: {noformat} INFORMATION: Test Name: pdf_sample_1-100pages; Files: 100; Pages: 100; Time(s): 4,112; Pages/Second: 24,319; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 1; Total Sources Size(K): 775; Merged File Size(K): 518; Ratio MaxStorageBytes/Merged File Size: 1 INFORMATION: Test Name: pdf_sample_1-200pages; Files: 200; Pages: 200; Time(s): 3,481; Pages/Second: 57,455; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 1; Total Sources Size(K): 1.551; Merged File Size(K): 1.038; Ratio MaxStorageBytes/Merged File Size: 0 INFORMATION: Test Name: pdf_sample_1-300pages; Files: 300; Pages: 300; Time(s): 3,746; Pages/Second: 80,085; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 2; Total Sources Size(K): 2.327; Merged File Size(K): 1.558; Ratio MaxStorageBytes/Merged File Size: 1 INFORMATION: Test Name: pdf_sample_1-400pages; Files: 400; Pages: 400; Time(s): 4,959; Pages/Second: 80,661; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 4; Total Sources Size(K): 3.103; Merged File Size(K): 2.078; Ratio MaxStorageBytes/Merged File Size: 1 INFORMATION: Summary: Pages: 1000, Time(s): 16,298, Pages/Second: 61,357 {noformat} which I was able to run with the following settings {noformat} runMergeTest("pdf_sample_1-100pages", defaultMemory, 1 * MEG); runMergeTest("pdf_sample_1-200pages", defaultMemory, 1 * MEG); runMergeTest("pdf_sample_1-300pages", defaultMemory, 2 * MEG); runMergeTest("pdf_sample_1-400pages", defaultMemory, 4 * MEG); {noformat} Of course this is a quick and dirty implementation/test to verify that closing only will bring the requirements down. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -------------------------------------------------------------------------------------------------- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement > Affects Versions: 2.0.9, 3.0.0 PDFBox > Reporter: Gary Potagal > Priority: Major > Attachments: PDFBOX-4188-breakingTest.zip, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org