[
https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438699#comment-16438699
]
Maruan Sahyoun commented on PDFBOX-4188:
----------------------------------------
[~gary.potagal] I've taken a quick look at the patch and would like to discuss
some topics
- PDFMergerUtility was using {{MemoryUsageSetting getPartitionedCopy}} where
now the setting is passed on for each PDDocument and is no longer partitioned.
So although the value used for {{MemoryUsageSetting}} is much lower now isn't
that at the end the same result?
- I haven't understood the main benefit of the changes done to
{{MemoryUsageSetting}} and {{ScratchFile}}. What is the reason for these?
- I think the patch should be divided in two parts - the changes to
{{MemoryUsageSetting}} / {{ScratchFile}} and the changes to PDFMerger with test
cases to show the improvements for each.
- Do you see a benefit in using {{MappedByteBuffer}}
- the handling of openAction doesn't belong into this patch. It should be part
of a new issue.
- the code doesn't follow the coding conventions
https://pdfbox.apache.org/codingconventions.html so there is some effort to
bring it in line with these. (I think that this section might be difficult to
find on our website - any suggestions to make it easier to find the information
is highly appreciated)
Many of the questions are because this part of PDFBox is something I rarely
touch - so I hope you're a little patient with me.
> "Maximum allowed scratch file memory exceeded." Exception when merging large
> number of small PDFs
> --------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-4188
> URL: https://issues.apache.org/jira/browse/PDFBOX-4188
> Project: PDFBox
> Issue Type: Improvement
> Affects Versions: 2.0.9, 3.0.0 PDFBox
> Reporter: Gary Potagal
> Priority: Major
> Attachments: PDFBOX-4188-MemoryManagerPatch.zip,
> PDFBOX-4188-breakingTest.zip, PDFMergerUtility.java-20180412.patch
>
>
>
> Am 06.04.2018 um 23:10 schrieb Gary Potagal:
>
> We wanted to address one more merge issue in
> org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting).
> We need to merge a large number of small files. We use mixed mode, memory
> and disk for cache. Initially, we would often get "Maximum allowed scratch
> file memory exceeded.", unless we turned off the check by passing "-1" to
> org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this
> is what the users that opened PDFBOX-3721 where running into.
> Our research indicates that the core issue with the memory model is that
> instead of sharing a single cache, it breaks it up into equal sized fixed
> partitions based on the number of input + output files being merged. This
> means that each partition must be big enough to hold the final output file.
> When 400 1-page files are merged, this creates 401 partitions, but each of
> which needs to be big enough to hold the final 400 pages. Even worse, the
> merge algorithm needs to keep all files open until the end.
> Given this, near the end of the merge, we're actually caching 400 x 1-page
> input files, and 1 x 400-page output file, or 801 pages.
> However, with the partitioned cache, we need to declare room for 401 x
> 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This
> would be a very high number, usually in GIGs.
>
> Given the current limitation that we need to keep all the input files open
> until the output file is written (HUGE), we came up with 2 options. (See
> PDFBOX-4182)
>
> 1. Good: Split the cache in ½, give ½ to the output file, and segment the
> other ½ across the input files. (Still keeping them open until then end).
> 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk
> on demand, release cache as documents are closed after merge. This is our
> current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are
> addressed.
>
> We would like to submit our current implementation as a Patch to 2.0.10 and
> 3.0.0, unless this is already addressed.
>
> Thank you
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]