[ 
https://issues.apache.org/jira/browse/PDFBOX-349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689932#action_12689932
 ] 

Justin LeFebvre edited comment on PDFBOX-349 at 3/27/09 9:53 AM:
-----------------------------------------------------------------

For this patch I worked with Brian Carrier.

With the attached fix, Brian and I have made spacing detection better for files 
such as this one where the page was scanned in a bit skewed. Now, instead of 
just relying on the reported width of the space character to determine if a 
space should be added to the text file. PDFBox still has that, but it also 
keeps a running average of the character widths seen previously. In order to 
determine if a space should be added, it compares the two widths,picks the 
smaller one, and adds it to our previous X position to show where we expect the 
next word to start. If the expected X position is less that our new X position, 
then we add a space. 

      was (Author: justinl):
    With the attached fix, I have made spacing detection better for files such 
as this one where the page was scanned in a bit skewed. Now, instead of just 
relying on the reported width of the space character to determine if a space 
should be added to the text file. PDFBox still has that, but it also keeps a 
running average of the character widths seen previously. In order to determine 
if a space should be added, it compares the two widths,picks the smaller one, 
and adds it to our previous X position to show where we expect the next word to 
start. If the expected X position is less that our new X position, then we add 
a space. 
  
> Spaces between words ignored in scanned pdf files
> -------------------------------------------------
>
>                 Key: PDFBOX-349
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-349
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Jukka Zitting
>         Attachments: SpacingFix.zip, UpdatedSpacingRegressionFiles.zip
>
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1922502&group_id=78314&atid=552832
> I am using PDF-Box-0.7.3.dll with C# and have tested extraction on two
> searchable pdfs that I have scanned in from paper. Spaces between words are
> ignored for both files. I have also tested another pdf file (which I
> downloaded from the internet) and it was parsed correctly. Unfortunately,
> the file is 1.2MB and the upload was blocked. Please send me an email
> (gkobz...@hotmail.com) and I will reply back with the file.
> Thanks for looking into this.
> Greg
> [Comment on SourceForge]
> Date: 2008-03-23 21:24
> Sender: gkobzeff
> Logged In: YES 
> user_id=2042611
> Originator: YES
> I have scanned the file into a smaller file size. I have attached the
> file.
> Thanks
> File Added: Advanced Pain Mgmt BW.pdf
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=271548&aid=1922502

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to