[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

Villu Ruusmann (JIRA) Wed, 06 Jan 2010 14:15:24 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797349#action_12797349
 ]


Villu Ruusmann commented on PDFBOX-588:
---------------------------------------

As discussed in pdfbox-users mailing list [1], this issue relates to the 
naivety of PDFTextStripper's line detection algorithm.

It doesn't take much skill to correct for obvious line wraps. I've attached a 
sample patch file which does so by taking notice of TextPosition instances 
which are located significantly below and to the left of the previous 
TextPosition instance. The current threshold values are arbitrary (eg. 5 times 
the width of space in the X-direction), and should be replaced with something 
more meaningful.

[1] http://markmail.org/message/4b3bqpx7zznyqljh

> Problem extracting text in newline characters
> ---------------------------------------------
>
>                 Key: PDFBOX-588
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-588
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: Win XP
>            Reporter: Hesham
>         Attachments: Enters-sample.pdf, PDFTextStripper.patch
>
>
> Hello ,
>  
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
> It ignores some Enter characters between lines, so the last word in the line 
> and the first word in the next line appear as 1 word without spaces between 
> them !!
> While if I copy the PDF text manually from the PDF and paste it in a text 
> editor, Enter characters appear after the same lines that caused the problem 
> in PDFBox.
> Please check the attached file as a sample.
>  
> Is there a way to fix this ?
>  
> Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-588) Problem extracting text in newline characters

Reply via email to