[jira] [Commented] (PDFBOX-2883) Unify memory handling

Timo Boehme (JIRA) Thu, 16 Jul 2015 01:04:35 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629400#comment-14629400
 ]


Timo Boehme commented on PDFBOX-2883:
-------------------------------------

The unified implementation I envisioned would allow you to configure it as you 
want it. It will not be much different to the ScratchFile from PDFBOX-2882. If 
you specify the main-memory it is allowed to use large enough (or even set a 
flag to use only main-memory) it will do so without creating/using a file at 
all. The implementation of the RandomAccess interface is at least as fast as of 
RandomAccessBuffer - so no need to use this anymore. Additionally it seems that 
RandomAccessBufferedFileInputStream is not needed as well anymore since all 
usages only require a RandomAccessRead (as far as I can see).

If you are on a production machine with different running applications it is 
not possible to set the usable heap to an arbitrary large value. You have to 
ensure to be runnable in a restricted amount of memory. Furthermore if the 
system needs to use its swap file the whole system starts to slow down which is 
not acceptable while it is ok for the application to take more time with a 
certain file. Additionally the paging we do might even produce larger files 
than the amount of virtual memory of the system thus we can even handle those 
files.

> Unify memory handling
> ---------------------
>
>                 Key: PDFBOX-2883
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2883
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 2.0.0
>            Reporter: Timo Boehme
>            Assignee: Timo Boehme
>
> PDFBOX now has at least 2 different mechanisms to use main memory vs. keeping 
> large data in temporary file: in case of provided input stream the stream is 
> copied to temporary file and all read PDF streams are handled by 
> RandomAccessBuffer/ScratchFile.
> In PDFBOX-2882 I've done a re-implementation for ScratchFile which is quite 
> fast and allows to set a maximum amount of memory to be used for its pages 
> before it starts using the scratch file. This implementation could be used as 
> the general 'backend' for all buffered streams and even the file input stream 
> copy. As long as the PDF fits into the allowed maximum memory it should 
> equally fast as RandomAccessBuffer while it allows for good control of memory 
> usage by going to scratch file if needed. This prevents OOM in case of large 
> files.
> In order to use this the PDDocument methods should be changed to not have a 
> 'useScratchFile' parameter but to take a MemoryHandling object which details 
> the Buffering strategy (using ScratchFile; what amount of main memory can be 
> used, ...).
> I've opened this issue for discussing. Since we need API changes in 
> PDDocument it should be done before 2.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-2883) Unify memory handling

Reply via email to