[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

JIRA Thu, 13 Jan 2011 12:09:09 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981452#action_12981452
 ]


Andreas Lehmkühler commented on PDFBOX-588:
-------------------------------------------

Without having a hand on that specific pdf it would be difficult to determine 
if there is an issue or not. If you are able to provide us with the pdf, please 
create a new issue and append the sample to it.

BTW: I ran some tests on the pdf-reference (> 1300 pages). It took me 24 
seconds to extract the whole text. The 3 versions 1.2.1, 1.3.1 and 1.4. 0 all 
needed the same amount of time.

> Problem extracting text in newline characters
> ---------------------------------------------
>
>                 Key: PDFBOX-588
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-588
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0
>         Environment: Win XP
>            Reporter: Hesham
>            Assignee: Andreas Lehmkühler
>         Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, 
> PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, 
> PDFTextStripper.patch
>
>
> Hello ,
>  
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
> It ignores some Enter characters between lines, so the last word in the line 
> and the first word in the next line appear as 1 word without spaces between 
> them !!
> While if I copy the PDF text manually from the PDF and paste it in a text 
> editor, Enter characters appear after the same lines that caused the problem 
> in PDFBox.
> Please check the attached file as a sample.
>  
> Is there a way to fix this ?
>  
> Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

Reply via email to