[ https://issues.apache.org/jira/browse/PDFBOX-5799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler reassigned PDFBOX-5799: ------------------------------------------ Assignee: Andreas Lehmkühler > Page with thousands of content streams takes extremely long to render or > extract > -------------------------------------------------------------------------------- > > Key: PDFBOX-5799 > URL: https://issues.apache.org/jira/browse/PDFBOX-5799 > Project: PDFBox > Issue Type: Bug > Components: Rendering, Text extraction > Affects Versions: 3.0.2 PDFBox > Reporter: Tilman Hausherr > Assignee: Andreas Lehmkühler > Priority: Major > Labels: performance > > As reported by Erik Branks on the mailing list: > {quote}when attempting text extraction from the PDF at > [https://d-nb.info/1324982411/34] , either using PDFBox 3.0.0 or PDFBox > 4.0.0-SNAPSHOT, the extraction uses about 1,8 GB heap memory and does not > seem to terminate. I cancelled the extraction attempt after roughly 20 > minutes. Is this another bad PDF or is there a bug in PDFBox?{quote} > This happens with pages 230 and 231 (maybe others). Both have thousands of > content streams in the content stream array. The profiler suggests that most > time is spent in {{SequenceRandomAccessRead.seek()}}. > Rendering page 230 with PDFBox 2.0: 50 seconds > Rendering page 230 with PDFBox trunk: 2990 seconds > Rendering page 231 with PDFBox trunk: 4798 seconds -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org