[
https://issues.apache.org/jira/browse/PDFBOX-5799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834502#comment-17834502
]
ASF subversion and git services commented on PDFBOX-5799:
---------------------------------------------------------
Commit 1916827 from [email protected] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1916827 ]
PDFBOX-5799: search forward/backwards if the new position is after/before the
current position to optimize the search for the correct stream
> Page with thousands of content streams takes extremely long to render or
> extract
> --------------------------------------------------------------------------------
>
> Key: PDFBOX-5799
> URL: https://issues.apache.org/jira/browse/PDFBOX-5799
> Project: PDFBox
> Issue Type: Bug
> Components: Rendering, Text extraction
> Affects Versions: 3.0.2 PDFBox
> Reporter: Tilman Hausherr
> Assignee: Andreas Lehmkühler
> Priority: Major
> Labels: performance
>
> As reported by Erik Branks on the mailing list:
> {quote}when attempting text extraction from the PDF at
> [https://d-nb.info/1324982411/34] , either using PDFBox 3.0.0 or PDFBox
> 4.0.0-SNAPSHOT, the extraction uses about 1,8 GB heap memory and does not
> seem to terminate. I cancelled the extraction attempt after roughly 20
> minutes. Is this another bad PDF or is there a bug in PDFBox?{quote}
> This happens with pages 230 and 231 (maybe others). Both have thousands of
> content streams in the content stream array. The profiler suggests that most
> time is spent in {{SequenceRandomAccessRead.seek()}}.
> Rendering page 230 with PDFBox 2.0: 50 seconds
> Rendering page 230 with PDFBox trunk: 2990 seconds
> Rendering page 231 with PDFBox trunk: 4798 seconds
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]