[ 
https://issues.apache.org/jira/browse/PDFBOX-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629556#comment-14629556
 ] 

Timo Boehme commented on PDFBOX-2883:
-------------------------------------

Do we use threading reading the file and parsing it so that there will be true 
concurrent access? The only problem I see with a single file would be that the 
pages used for file buffer might not be in direct sequence because other 
buffers may be created while parsing. However this would only affect 
performance if the same portion of the file spanning multiple pages is read 
again in its whole - and only if it really uses temporary file (and not 
main-memory pages) and the OS hasn't buffered the file content in cache (most 
probably the file content will be cached as it was written shortly before).

Nevertheless we can also have 2 files by creating another instance of 
ScratchFile (this class name has to be renamed as it is now merely a page 
handler which has the ability to store pages in a file).

> Unify memory handling
> ---------------------
>
>                 Key: PDFBOX-2883
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2883
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 2.0.0
>            Reporter: Timo Boehme
>            Assignee: Timo Boehme
>
> PDFBOX now has at least 2 different mechanisms to use main memory vs. keeping 
> large data in temporary file: in case of provided input stream the stream is 
> copied to temporary file and all read PDF streams are handled by 
> RandomAccessBuffer/ScratchFile.
> In PDFBOX-2882 I've done a re-implementation for ScratchFile which is quite 
> fast and allows to set a maximum amount of memory to be used for its pages 
> before it starts using the scratch file. This implementation could be used as 
> the general 'backend' for all buffered streams and even the file input stream 
> copy. As long as the PDF fits into the allowed maximum memory it should 
> equally fast as RandomAccessBuffer while it allows for good control of memory 
> usage by going to scratch file if needed. This prevents OOM in case of large 
> files.
> In order to use this the PDDocument methods should be changed to not have a 
> 'useScratchFile' parameter but to take a MemoryHandling object which details 
> the Buffering strategy (using ScratchFile; what amount of main memory can be 
> used, ...).
> I've opened this issue for discussing. Since we need API changes in 
> PDDocument it should be done before 2.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to