[ 
https://issues.apache.org/jira/browse/PDFBOX-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-6145:
------------------------------------
    Issue Type: Improvement  (was: Bug)

> Extremely slow text extraction of single page of large PDF
> ----------------------------------------------------------
>
>                 Key: PDFBOX-6145
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6145
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.35, 3.0.6 PDFBox
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>            Priority: Minor
>              Labels: optimization
>             Fix For: 2.0.36, 3.0.7 PDFBox, 4.0.0
>
>         Attachments: screenshot-1.png
>
>
> happens with
> https://www.mouser.ca/catalog/catalogcad/646/dload/pdf/MOUSER.pdf
> discovered by showing the first page with PDFDebugger, rendering done in a 
> few seconds, but display minutes later, this is because of the invisible text 
> extraction that happens.
> The cause is that the stripper goes through all pages, checks whether there 
> is content, and only then checks whether the page is to be extracted.
>  !screenshot-1.png! 
> Alternatively it can be reproduced with this code
> {code:java}
>         PDFTextStripper s = new PDFTextStripper();
>         s.setStartPage(1);
>         s.setEndPage(1);
>         String text = s.getText(doc);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to