[ 
https://issues.apache.org/jira/browse/PDFBOX-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-6145:
------------------------------------
    Description: 
happens with
https://www.mouser.ca/catalog/catalogcad/646/dload/pdf/MOUSER.pdf
discovered by showing the first page with PDFDebugger, rendering done in a few 
seconds, but display minutes later, this is because of the invisible text 
extraction that happens.

The cause is that the stripper goes through all pages, checks whether there is 
content, and only then checks whether the page is to be extracted.

 !screenshot-1.png! 

Alternatively it can be reproduced with this code

{code:java}
        PDFTextStripper s = new PDFTextStripper();
        s.setStartPage(1);
        s.setEndPage(1);
        String text = s.getText(doc);
{code}

  was:
happens with
https://www.mouser.ca/catalog/catalogcad/646/dload/pdf/MOUSER.pdf
discovered by showing the first page with PDFDebugger, rendering done in a few 
seconds, but display minutes later, this is because of the invisible text 
extraction that happens.

The cause is that the stripper goes through all pages, checks whether there is 
content, and only then checks whether the page is to be extracted.

Alternatively it can be reproduced with this code

{code:java}
        PDFTextStripper s = new PDFTextStripper();
        s.setStartPage(1);
        s.setEndPage(1);
        String text = s.getText(doc);
{code}


> Extremely slow text extraction of single page of large PDF
> ----------------------------------------------------------
>
>                 Key: PDFBOX-6145
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6145
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.35, 3.0.6 PDFBox
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>            Priority: Minor
>              Labels: optimization
>             Fix For: 2.0.36, 3.0.7 PDFBox, 4.0.0
>
>         Attachments: screenshot-1.png
>
>
> happens with
> https://www.mouser.ca/catalog/catalogcad/646/dload/pdf/MOUSER.pdf
> discovered by showing the first page with PDFDebugger, rendering done in a 
> few seconds, but display minutes later, this is because of the invisible text 
> extraction that happens.
> The cause is that the stripper goes through all pages, checks whether there 
> is content, and only then checks whether the page is to be extracted.
>  !screenshot-1.png! 
> Alternatively it can be reproduced with this code
> {code:java}
>         PDFTextStripper s = new PDFTextStripper();
>         s.setStartPage(1);
>         s.setEndPage(1);
>         String text = s.getText(doc);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to