[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

JIRA Tue, 04 Jan 2011 11:38:10 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977426#action_12977426
 ]


Andreas Lehmkühler commented on PDFBOX-588:
-------------------------------------------

I found a solution for the (rendering-)issue. It is a dirty hack and I'll need 
some time to create a suitable patch. I've attached the rendering result as png.

The extraction is also improved but not perfect. The problem is the fact that 
the single lines of the caption and the text are not at the same level, so that 
it is difficult to decide, wether two lines are on one line or not. I've also 
attached the extraction result.


> Problem extracting text in newline characters
> ---------------------------------------------
>
>                 Key: PDFBOX-588
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-588
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator, 1.3.1, 1.4.0
>         Environment: Win XP
>            Reporter: Hesham
>            Assignee: Andreas Lehmkühler
>         Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample.txt, 
> PDFBOX588-Enters-sample1.png, PDFBOX588-Enters-sample1.png, 
> PDFTextStripper.patch
>
>
> Hello ,
>  
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
> It ignores some Enter characters between lines, so the last word in the line 
> and the first word in the next line appear as 1 word without spaces between 
> them !!
> While if I copy the PDF text manually from the PDF and paste it in a text 
> editor, Enter characters appear after the same lines that caused the problem 
> in PDFBox.
> Please check the attached file as a sample.
>  
> Is there a way to fix this ?
>  
> Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

Reply via email to