[ 
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler closed PDFBOX-588.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.5.0

I just realized that the sample code extracts one page after the other using 
the startPage/endPage feature, which leads to a lot of overhead as the 
PDFTextStripper iterates over all pages for each single page extraction. Let me 
explain it using some simple figures:

- the extraction of the whole text took 8 sec without any options but the 
"-sort"
- the extraction of a single page took 0,4 sec using "-sort -startPage 200 
-endPage 200"
- assuming that the extraction for each page took the same time (for the sake 
of convenience) leads to: 1310 (pages) * 0,4 sec (time per page) = 524 sec

IMHO that's the explanation for Heshams issue.

Set to closed as we didn't get any more input for more than 2 years.

Thanks to all for your help.
                
> Problem extracting text in newline characters
> ---------------------------------------------
>
>                 Key: PDFBOX-588
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-588
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0
>         Environment: Win XP
>            Reporter: Hesham
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.5.0
>
>         Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample1.png, 
> PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample.txt, 
> PDFTextStripper.patch
>
>
> Hello ,
>  
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
> It ignores some Enter characters between lines, so the last word in the line 
> and the first word in the next line appear as 1 word without spaces between 
> them !!
> While if I copy the PDF text manually from the PDF and paste it in a text 
> editor, Enter characters appear after the same lines that caused the problem 
> in PDFBox.
> Please check the attached file as a sample.
>  
> Is there a way to fix this ?
>  
> Best regards ,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to