[ 
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980829#action_12980829
 ] 

Hesham commented on PDFBOX-588:
-------------------------------

You are right Mel ... It is not because of the paragraph demarcation code. 

I have taken a copy from PDFTextStripper in 0.7.3 and used it instead of the 
one in version 1.4 ... Built PDFBox and tested it. It did the same thing. It 
parsed the PDF in a long time too.

So it seems nothing related to PDFTextStripper !
May be it is related to the cmap files ... I am not professional in this. I 
hope you can look at this when you are free. I have a PDF of 1500 pages. It 
took 7 minutes to extract its data :)

Thanks a lot.

> Problem extracting text in newline characters
> ---------------------------------------------
>
>                 Key: PDFBOX-588
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-588
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0
>         Environment: Win XP
>            Reporter: Hesham
>            Assignee: Andreas Lehmkühler
>         Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, 
> PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, 
> PDFTextStripper.patch
>
>
> Hello ,
>  
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
> It ignores some Enter characters between lines, so the last word in the line 
> and the first word in the next line appear as 1 word without spaces between 
> them !!
> While if I copy the PDF text manually from the PDF and paste it in a text 
> editor, Enter characters appear after the same lines that caused the problem 
> in PDFBox.
> Please check the attached file as a sample.
>  
> Is there a way to fix this ?
>  
> Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to