[jira] [Commented] (PDFBOX-6145) Extremely slow text extraction of single page of large PDF

ASF subversion and git services (Jira) Tue, 13 Jan 2026 03:47:31 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051481#comment-18051481
 ]


ASF subversion and git services commented on PDFBOX-6145:
---------------------------------------------------------

Commit 1931288 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1931288 ]

PDFBOX-6145: move content check after page number check so that not all pages 
get checked

> Extremely slow text extraction of single page of large PDF
> ----------------------------------------------------------
>
>                 Key: PDFBOX-6145
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6145
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.35, 3.0.6 PDFBox
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>            Priority: Minor
>              Labels: optimization
>             Fix For: 2.0.36, 3.0.7 PDFBox, 4.0.0
>
>
> happens with
> https://www.mouser.ca/catalog/catalogcad/646/dload/pdf/MOUSER.pdf
> discovered by showing the first page with PDFDebugger, rendering done in a 
> few seconds, but display minutes later, this is because of the invisible text 
> extraction that happens.
> The cause is that the stripper goes through all pages, checks whether there 
> is content, and only then checks whether the page is to be extracted.
> Alternatively it can be reproduced with this code
> {code:java}
>         PDFTextStripper s = new PDFTextStripper();
>         s.setStartPage(1);
>         s.setEndPage(1);
>         String text = s.getText(doc);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-6145) Extremely slow text extraction of single page of large PDF

Reply via email to