[
https://issues.apache.org/jira/browse/PDFBOX-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051481#comment-18051481
]
ASF subversion and git services commented on PDFBOX-6145:
---------------------------------------------------------
Commit 1931288 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1931288 ]
PDFBOX-6145: move content check after page number check so that not all pages
get checked
> Extremely slow text extraction of single page of large PDF
> ----------------------------------------------------------
>
> Key: PDFBOX-6145
> URL: https://issues.apache.org/jira/browse/PDFBOX-6145
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.35, 3.0.6 PDFBox
> Reporter: Tilman Hausherr
> Assignee: Tilman Hausherr
> Priority: Minor
> Labels: optimization
> Fix For: 2.0.36, 3.0.7 PDFBox, 4.0.0
>
>
> happens with
> https://www.mouser.ca/catalog/catalogcad/646/dload/pdf/MOUSER.pdf
> discovered by showing the first page with PDFDebugger, rendering done in a
> few seconds, but display minutes later, this is because of the invisible text
> extraction that happens.
> The cause is that the stripper goes through all pages, checks whether there
> is content, and only then checks whether the page is to be extracted.
> Alternatively it can be reproduced with this code
> {code:java}
> PDFTextStripper s = new PDFTextStripper();
> s.setStartPage(1);
> s.setEndPage(1);
> String text = s.getText(doc);
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]