[ 
https://issues.apache.org/jira/browse/PDFBOX-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17745924#comment-17745924
 ] 

Andreas Lehmkühler edited comment on PDFBOX-3712 at 7/22/23 1:36 PM:
---------------------------------------------------------------------

I've replaced the usage of ByteArrayOutputStream/ByteArrayInputStream with the 
new RandomAccessReadWriteBuffer.

PDFBox now supports decoded streams with more than 2GB as it uses chunks with a 
default size of 4kb. The rw-buffer is used as out- and input so that it is no 
longer necessary to copy the data to an intermediate byte array. In the end the 
memory foot print is reduced one more time. 

Additionally the chunksize is adjusted according to the estimated stream size 
so that we don't waste to much memory if a pdf contains lots of small streams 
such as the example from PDFBOX-5530



was (Author: lehmi):
I've replaced the usage of ByteArrayOutputStream/ByteArrayInputStream with the 
new RandomAccessReadWriteBuffer. PDFBox now supports decoded streams with more 
than 2GB as it uses chunks with a default size of 4kb. The rw-buffer is used as 
out- and input so that it is no longer necessary to copy the data to an 
intermediate byte array. In the end the memory foot print is reduced one more 
time. Additionally the chunksize is adjusted according to the estimated stream 
size so that we don't waste to much memory if a pdf contains losts of small 
streams such as the example from PDFBOX-5530


> PDFBox goes into an infinite loop with this PDF
> -----------------------------------------------
>
>                 Key: PDFBOX-3712
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3712
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.4
>            Reporter: Dirk Groeneveld
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>         Attachments: PDFBOX-3712-page6-rendered.png
>
>
> The PDF at 
> https://pdfs.semanticscholar.org/2095/e3df01fc32e0bff982a1e79600d5bcf10b81.pdf
>  puts PDFBox into an infinite loop.
> This is roughly my code:
> {quote}
> final PDDocument pdDoc = PDDocument.load(inputStream);
> PDFTextStripper stripper = new PDFTextStripper();
> stripper.getText(pdDoc);
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to