[jira] Commented: (PDFBOX-172) Letters and newlines disappear

Justin LeFebvre (JIRA) Mon, 06 Apr 2009 13:39:35 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696242#action_12696242
 ]


Justin LeFebvre commented on PDFBOX-172:
----------------------------------------

All of these issues have to do with text extraction specifically and should be 
in that section. Regardless, without a test file to compare against, I'm not 
sure how to test to see if these have been fixed. A lot of the work done 
recently to Pdfbox was done with these issues specifically so I believe them to 
be fixed, however, I can't be positive. 

> Letters and newlines disappear
> ------------------------------
>
>                 Key: PDFBOX-172
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-172
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>            Priority: Minor
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1502153
> Originally submitted by nobody on 2006-06-07 03:26.
> The attached file is extracted as text using 
> PDFTextConverter.writeText(PDDocument, StringWriter) .
> The output text is problematic:
> 1. Words are fused together: ASTROBIOLOGYVolume, 
> PaperThe, ABSTRACTThe, etc.
> 2. Words are mis-spelled: "comunity"
> 3. The bottom part of the ABSTRACT (at the beginning 
> of the 2nd page of the PDF) is found AFTER the rest 
> of the content of the 2nd page. 
> 4. The two columns of text from the 3rd page of the 
> PDF are found in reversed order in the XML.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-172) Letters and newlines disappear

Reply via email to