[ 
https://issues.apache.org/jira/browse/PDFBOX-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956067#comment-14956067
 ] 

Ben McCann edited comment on PDFBOX-3019 at 10/14/15 1:03 AM:
--------------------------------------------------------------

The extraneous spacing problem is common enough in resumes that I was able to 
find the [sample 
resume|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
 I posted earlier just by downloading half a dozen random resumes from Google. 
However, it's not showing up in the digital corpora files that I'm looking at. 
Perhaps this issue only shows up with certain fonts or with documents that were 
converted from word to pdf?

To help diagnose, I edited the example doc that I found from Google so that now 
all it says is "[email protected]". Pdfbox is parsing this as "jb 
[email protected]". Hopefully someone with more pdfbox familiarity than I have 
will be able to track down what the issue is. The [sample 
resume|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
 from the web might be a better test case to verify if the problem is fixed, 
but I wasn't sure if we'd be allowed to commit a random document from the web 
or if it'd need to be appropriately licensed. We should be able to commit the 
edited document for use in a test. I'm wondering if there might be some bug 
beyond just the tolerance settings since these letters look like they're quite 
close together to the human eye


was (Author: chengas123):
The extraneous spacing problem is common enough in resumes that I was able to 
find the [sample 
resume|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
 I posted earlier just by downloading half a dozen random resumes from Google. 
However, it's not showing up in the digital corpora files that I'm looking at. 
Perhaps this issue only shows up with certain fonts or with documents that were 
converted from word to pdf?

To help diagnose, I edited the example doc that I found from Google so that now 
all it says is "[email protected]". Pdfbox is parsing this as "jb 
[email protected]". Hopefully someone with more pdfbox familiarity than I have 
will be able to track down what the issue is. The [sample 
resume|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
 from the web might be a better test case to verify if the problem is fixed, 
but I wasn't sure if we'd be allowed to commit a random document from the web 
or if it'd need to be appropriately licensed. The editted document we should be 
able to commit in a test. I'm wondering if there might be some bug beyond just 
the tolerance settings since these letters look like they're quite close 
together to the human eye

> Optimize tolerance settings
> ---------------------------
>
>                 Key: PDFBOX-3019
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3019
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>             Fix For: 2.0.0
>
>         Attachments: jbl-example-com.pdf
>
>
> From testing on my internal dataset I believe there might be some regression 
> in the effectiveness of PDFTextStripper.
> Here's an [example 
> doc|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
>  I found on the web, which converted better in 1.8 than 2.0. Notice that it 
> extracts "J e a n e t t e  A c o s t a ;  S e r v i c e  M a n a g e r  a t  
> M a d  F o x  B r e w i n g  C o m p a n y". It doesn't seem like there's 
> very much space between the letters in the pdf, so it's curious to me that it 
> didn't do too well.
> I realize this is an area where we probably can't strive for perfection. Yet, 
> it does seem to me that from 1.8 to 2.0 we may have taken a step backwards. I 
> believe there's some sort of regression test for PDFToImage which exports a 
> set of pdfs to images at two different commits and looks at what the 
> differences are. Do we have the same sort of thing for PDFTextStripper? If 
> not, can we build one by pulling docs off the public web? I'd be willing to 
> contribute to this endeavor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to