[jira] Issue Comment Edited: (PDFBOX-588) Problem extracting text in newline characters

Hesham (JIRA) Tue, 04 Jan 2011 06:40:11 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977281#action_12977281
 ]


Hesham edited comment on PDFBOX-588 at 1/4/11 9:39 AM:
-------------------------------------------------------

Thanks a lot Mel and Andreas for the investigation ... 
'PDFTextStripper.setSpacingTolerance(float)' method is very interesting. I have 
tested it on 5 PDFs & the best value for me was (0.3f). It mostly extracts all 
words right.

As for the attached PDF in this issue, the problem of spaces is now limited to 
the last words of the paragraph at the low left side like :
"be able to read about Paul Revere's midnight" -> 
"beabletoreadaboutPaulRevere'smidnight"
"journey only a" -> "journeyonlya"

If i used a spacing tolerance (0.1f), those words will be extracted right, but 
in return other words will appear wrong like :
"UNCENSORED REVOLUTIONARY WAR HISTORY" -> "U N C E N S O R E D R E V O L U T I 
O N A R Y W A R H I S T O R Y"

So i guess i will leave it with value (0.3)f which is much better. I will check 
now the Enters problem in PDFBox-521.

      was (Author: hesham):
    Thanks a lot Mel and Andreas for the investigation ... 
'PDFTextStripper.setSpacingTolerance(float)' method is very interesting. I have 
tested it on 5 PDFs & the best value for me was (0.3f). It mostly extracts all 
words right.

As for the attached PDF in this issue, the problem of spaces is now limited to 
the last words of the paragraph at the low left side like :
"able to" -> "ableto"
"in order" -> "inorder"
"But not" -> "Butnot"
"who set" -> "whoset"

I think this is because of the 'Enters' problem. I will check it now in 
PDFBox-521.
  
> Problem extracting text in newline characters
> ---------------------------------------------
>
>                 Key: PDFBOX-588
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-588
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: Win XP
>            Reporter: Hesham
>         Attachments: Enters-sample.pdf, PDFBOX588-Enters-sample1.png, 
> PDFTextStripper.patch
>
>
> Hello ,
>  
> I have a PDF file with 1 page only, when I try to extract its text using :
> String pageData = stripper.getText( pdfFile );
> It ignores some Enter characters between lines, so the last word in the line 
> and the first word in the next line appear as 1 word without spaces between 
> them !!
> While if I copy the PDF text manually from the PDF and paste it in a text 
> editor, Enter characters appear after the same lines that caused the problem 
> in PDFBox.
> Please check the attached file as a sample.
>  
> Is there a way to fix this ?
>  
> Best regards ,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-588) Problem extracting text in newline characters

Reply via email to