[ 
https://issues.apache.org/jira/browse/PDFBOX-5799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler reassigned PDFBOX-5799:
------------------------------------------

    Assignee: Andreas Lehmkühler

> Page with thousands of content streams takes extremely long to render or 
> extract
> --------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5799
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5799
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering, Text extraction
>    Affects Versions: 3.0.2 PDFBox
>            Reporter: Tilman Hausherr
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>              Labels: performance
>
> As reported by Erik Branks on the mailing list:
> {quote}when attempting text extraction from the PDF at 
> [https://d-nb.info/1324982411/34] , either using PDFBox 3.0.0 or PDFBox 
> 4.0.0-SNAPSHOT, the extraction uses about 1,8 GB heap memory and does not 
> seem to terminate. I cancelled the extraction attempt after roughly 20 
> minutes. Is this another bad PDF or is there a bug in PDFBox?{quote}
> This happens with pages 230 and 231 (maybe others). Both have thousands of 
> content streams in the content stream array. The profiler suggests that most 
> time is spent in {{SequenceRandomAccessRead.seek()}}.
> Rendering page 230 with PDFBox 2.0: 50 seconds
> Rendering page 230 with PDFBox trunk: 2990 seconds
> Rendering page 231 with PDFBox trunk: 4798 seconds 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to