Ben McCann created PDFBOX-3019:
----------------------------------

             Summary: Optimize tolerance settings
                 Key: PDFBOX-3019
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3019
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 2.0.0
            Reporter: Ben McCann
             Fix For: 2.0.0


>From testing on my internal dataset I believe there might be some regression 
>in the effectiveness of PDFTextStripper.

Here's an [example 
doc|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
 I found on the web, which converted better in 1.8 than 2.0. Notice that it 
extracts "J e a n e t t e  A c o s t a ;  S e r v i c e  M a n a g e r  a t  M 
a d  F o x  B r e w i n g  C o m p a n y". It doesn't seem like there's very 
much space between the letters in the pdf, so it's curious to me that it didn't 
do too well.

I realize this is an area where we probably can't strive for perfection. Yet, 
it does seem to me that from 1.8 to 2.0 we may have taken a step backwards. I 
believe there's some sort of regression test for PDFToImage which exports a set 
of pdfs to images at two different commits and looks at what the differences 
are. Do we have the same sort of thing for PDFTextStripper? If not, can we 
build one by pulling docs off the public web? I'd be willing to contribute to 
this endeavor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to