[ https://issues.apache.org/jira/browse/PDFBOX-349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689932#action_12689932 ]
Justin LeFebvre edited comment on PDFBOX-349 at 3/27/09 9:53 AM: ----------------------------------------------------------------- For this patch I worked with Brian Carrier. With the attached fix, Brian and I have made spacing detection better for files such as this one where the page was scanned in a bit skewed. Now, instead of just relying on the reported width of the space character to determine if a space should be added to the text file. PDFBox still has that, but it also keeps a running average of the character widths seen previously. In order to determine if a space should be added, it compares the two widths,picks the smaller one, and adds it to our previous X position to show where we expect the next word to start. If the expected X position is less that our new X position, then we add a space. was (Author: justinl): With the attached fix, I have made spacing detection better for files such as this one where the page was scanned in a bit skewed. Now, instead of just relying on the reported width of the space character to determine if a space should be added to the text file. PDFBox still has that, but it also keeps a running average of the character widths seen previously. In order to determine if a space should be added, it compares the two widths,picks the smaller one, and adds it to our previous X position to show where we expect the next word to start. If the expected X position is less that our new X position, then we add a space. > Spaces between words ignored in scanned pdf files > ------------------------------------------------- > > Key: PDFBOX-349 > URL: https://issues.apache.org/jira/browse/PDFBOX-349 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Reporter: Jukka Zitting > Attachments: SpacingFix.zip, UpdatedSpacingRegressionFiles.zip > > > [Issue from SourceForge] > http://sourceforge.net/tracker/index.php?func=detail&aid=1922502&group_id=78314&atid=552832 > I am using PDF-Box-0.7.3.dll with C# and have tested extraction on two > searchable pdfs that I have scanned in from paper. Spaces between words are > ignored for both files. I have also tested another pdf file (which I > downloaded from the internet) and it was parsed correctly. Unfortunately, > the file is 1.2MB and the upload was blocked. Please send me an email > (gkobz...@hotmail.com) and I will reply back with the file. > Thanks for looking into this. > Greg > [Comment on SourceForge] > Date: 2008-03-23 21:24 > Sender: gkobzeff > Logged In: YES > user_id=2042611 > Originator: YES > I have scanned the file into a smaller file size. I have attached the > file. > Thanks > File Added: Advanced Pain Mgmt BW.pdf > http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=271548&aid=1922502 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.