Problem extracting text in newline characters
---------------------------------------------
Key: PDFBOX-588
URL: https://issues.apache.org/jira/browse/PDFBOX-588
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 0.8.0-incubator
Environment: Win XP
Reporter: Hesham
Hello ,
I have a PDF file with 1 page only, when I try to extract its text using :
String pageData = stripper.getText( pdfFile );
It ignores some Enter characters between lines, so the last word in the line
and the first word in the next line appear as 1 word without spaces between
them !!
While if I copy the PDF text manually from the PDF and paste it in a text
editor, Enter characters appear after the same lines that caused the problem in
PDFBox.
You can download the PDF file from here to try it :
http://www.4shared.com/file/185259485/5d937eb/Enters-sample.html
Is there a way to fix this ?
Best regards ,
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.