[ 
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435230#comment-16435230
 ] 

Maruan Sahyoun edited comment on PDFBOX-4188 at 4/12/18 9:43 AM:
-----------------------------------------------------------------

These are the results on my machine
{quote}INFORMATION: Test Name: pdf_sample_1-100pages; Files: 100; Pages: 100; 
Time(s): 3,125; Pages/Second: 32,000; MaxMainMemoryBytes(MB): 10; 
MaxStorageBytes(MB): 74; Total Sources Size(K): 775; Merged File Size(K): 522; 
Ratio MaxStorageBytes/Merged File Size: 145
INFORMATION: Test Name: pdf_sample_1-200pages; Files: 200; Pages: 200; Time(s): 
3,028; Pages/Second: 66,050; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 
315; Total Sources Size(K): 1.551; Merged File Size(K): 1.042; Ratio 
MaxStorageBytes/Merged File Size: 309
INFORMATION: Test Name: pdf_sample_1-300pages; Files: 300; Pages: 300; Time(s): 
4,081; Pages/Second: 73,511; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 
710; Total Sources Size(K): 2.327; Merged File Size(K): 1.562; Ratio 
MaxStorageBytes/Merged File Size: 465
INFORMATION: Test Name: pdf_sample_1-400pages; Files: 400; Pages: 400; Time(s): 
5,516; Pages/Second: 72,516; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 
1.240; Total Sources Size(K): 3.103; Merged File Size(K): 2.082; Ratio 
MaxStorageBytes/Merged File Size: 609
INFORMATION: Summary: Pages: 1000, Time(s): 15,750, Pages/Second: 63,492
{quote}

On my machine the tests fail with the following settings

{quote}runMergeTest("pdf_sample_1-100pages", defaultMemory, 70 * MEG);
 runMergeTest("pdf_sample_1-200pages", defaultMemory, 310 * MEG);
 runMergeTest("pdf_sample_1-300pages", defaultMemory, 700 * MEG);
 runMergeTest("pdf_sample_1-400pages", defaultMemory, 1200 * MEG);
{quote}


was (Author: msahyoun):
on my machine the tests fail with the following settings

{quote}
            runMergeTest("pdf_sample_1-100pages", defaultMemory, 70 * MEG);
            runMergeTest("pdf_sample_1-200pages", defaultMemory, 310 * MEG);
            runMergeTest("pdf_sample_1-300pages", defaultMemory, 700 * MEG);
            runMergeTest("pdf_sample_1-400pages", defaultMemory, 1200 * MEG);
{quote}

>  "Maximum allowed scratch file memory exceeded." Exception when merging large 
> number of small PDFs
> --------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4188
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4188
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.9, 3.0.0 PDFBox
>            Reporter: Gary Potagal
>            Priority: Major
>         Attachments: PDFBOX-4188-breakingTest.zip
>
>
>  
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>  
> We wanted to address one more merge issue in 
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files.  We use mixed mode, memory 
> and disk for cache.  Initially, we would often get "Maximum allowed scratch 
> file memory exceeded.", unless we turned off the check by passing "-1" to 
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting.  I believe, this 
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that 
> instead of sharing a single cache, it breaks it up into equal sized fixed 
> partitions based on the number of input + output files being merged.  This 
> means that each partition must be big enough to hold the final output file.  
> When 400 1-page files are merged, this creates 401 partitions, but each of 
> which needs to be big enough to hold the final 400 pages.  Even worse, the 
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page 
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401  x 
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes".  This 
> would be a very high number, usually in GIGs.
>  
> Given the current limitation that we need to keep all the input files open 
> until the output file is written (HUGE), we came up with 2 options.  (See 
> PDFBOX-4182)  
>  
> 1.  Good: Split the cache in ½, give ½ to the output file, and segment the 
> other ½ across the input files. (Still keeping them open until then end).
> 2.  Better: Dynamically allocate in 16 page (64K) chunks from memory or disk 
> on demand, release cache as documents are closed after merge.  This is our 
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are 
> addressed.
>  
> We would like to submit our current implementation as a Patch to 2.0.10 and 
> 3.0.0, unless this is already addressed.
>  
>  Thank you



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to