[
https://issues.apache.org/jira/browse/PDFBOX-172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696242#action_12696242
]
Justin LeFebvre commented on PDFBOX-172:
----------------------------------------
All of these issues have to do with text extraction specifically and should be
in that section. Regardless, without a test file to compare against, I'm not
sure how to test to see if these have been fixed. A lot of the work done
recently to Pdfbox was done with these issues specifically so I believe them to
be fixed, however, I can't be positive.
> Letters and newlines disappear
> ------------------------------
>
> Key: PDFBOX-172
> URL: https://issues.apache.org/jira/browse/PDFBOX-172
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Priority: Minor
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1502153
> Originally submitted by nobody on 2006-06-07 03:26.
> The attached file is extracted as text using
> PDFTextConverter.writeText(PDDocument, StringWriter) .
> The output text is problematic:
> 1. Words are fused together: ASTROBIOLOGYVolume,
> PaperThe, ABSTRACTThe, etc.
> 2. Words are mis-spelled: "comunity"
> 3. The bottom part of the ABSTRACT (at the beginning
> of the 2nd page of the PDF) is found AFTER the rest
> of the content of the 2nd page.
> 4. The two columns of text from the 3rd page of the
> PDF are found in reversed order in the XML.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.