[ 
https://issues.apache.org/jira/browse/PDFBOX-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632850#comment-14632850
 ] 

Timo Boehme commented on PDFBOX-2883:
-------------------------------------

As propsed I've added the complete {{ScratchFile}} support with r1691833. The 
ScratchFile was enhanced to support unlimited main-memory and overall size 
restriction (maximum size of in-memory + scratch file size). Additionally a 
{{ScratchFileBuffer}} can be pre-populated with data from InputStream 
({{ScratchFile.createBuffer(InputStream)}}).
The existing {{PDDocument.load}} methods work as before. I've duplicated all of 
them having a {{useScratchFile}} paramter, replacing this parameter with 
{{MemoryUsageSetting}}. These methods will create a ScratchFile with the 
defined settings and only use this - even the InputStream load methods will 
load the input stream into a {{ScratchFileBuffer}}.

With this version is is possible to compare the old implementation (use/use not 
scratch file) with the new one (especially the settings mainMemoryOnly, 
fileOnly, mixed mode).

Since up to now the buffering handling was document specific the places where 
the ScratchFile object is closed are not optimal (e.g. we have a double close 
in case of data as InputStream). Additionally it could be valuable to have a 
single ScratchFile object to be used with parsing multiple documents. Here we 
would have to ensure that the object is not closed by an individual document.

> Unify memory handling
> ---------------------
>
>                 Key: PDFBOX-2883
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2883
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 2.0.0
>            Reporter: Timo Boehme
>            Assignee: Timo Boehme
>         Attachments: MemoryUsage.java
>
>
> PDFBOX now has at least 2 different mechanisms to use main memory vs. keeping 
> large data in temporary file: in case of provided input stream the stream is 
> copied to temporary file and all read PDF streams are handled by 
> RandomAccessBuffer/ScratchFile.
> In PDFBOX-2882 I've done a re-implementation for ScratchFile which is quite 
> fast and allows to set a maximum amount of memory to be used for its pages 
> before it starts using the scratch file. This implementation could be used as 
> the general 'backend' for all buffered streams and even the file input stream 
> copy. As long as the PDF fits into the allowed maximum memory it should 
> equally fast as RandomAccessBuffer while it allows for good control of memory 
> usage by going to scratch file if needed. This prevents OOM in case of large 
> files.
> In order to use this the PDDocument methods should be changed to not have a 
> 'useScratchFile' parameter but to take a MemoryHandling object which details 
> the Buffering strategy (using ScratchFile; what amount of main memory can be 
> used, ...).
> I've opened this issue for discussing. Since we need API changes in 
> PDDocument it should be done before 2.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to