[jira] [Commented] (PDFBOX-5799) Page with thousands of content streams takes extremely long to render or extract

ASF subversion and git services (Jira) Sat, 06 Apr 2024 01:13:04 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834502#comment-17834502
 ]


ASF subversion and git services commented on PDFBOX-5799:
---------------------------------------------------------

Commit 1916827 from [email protected] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1916827 ]

PDFBOX-5799: search forward/backwards if the new position is after/before the 
current position to optimize the search for the correct stream

> Page with thousands of content streams takes extremely long to render or 
> extract
> --------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5799
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5799
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering, Text extraction
>    Affects Versions: 3.0.2 PDFBox
>            Reporter: Tilman Hausherr
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>              Labels: performance
>
> As reported by Erik Branks on the mailing list:
> {quote}when attempting text extraction from the PDF at 
> [https://d-nb.info/1324982411/34] , either using PDFBox 3.0.0 or PDFBox 
> 4.0.0-SNAPSHOT, the extraction uses about 1,8 GB heap memory and does not 
> seem to terminate. I cancelled the extraction attempt after roughly 20 
> minutes. Is this another bad PDF or is there a bug in PDFBox?{quote}
> This happens with pages 230 and 231 (maybe others). Both have thousands of 
> content streams in the content stream array. The profiler suggests that most 
> time is spent in {{SequenceRandomAccessRead.seek()}}.
> Rendering page 230 with PDFBox 2.0: 50 seconds
> Rendering page 230 with PDFBox trunk: 2990 seconds
> Rendering page 231 with PDFBox trunk: 4798 seconds 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5799) Page with thousands of content streams takes extremely long to render or extract

Reply via email to