[ 
https://issues.apache.org/jira/browse/PDFBOX-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955255#comment-14955255
 ] 

Ben McCann commented on PDFBOX-3019:
------------------------------------

Is there anything that checks the results of running against these files? If I 
want to improve TextStripper is the best course of action to grab a few of 
these files and add to TestTextStripper?

> Optimize tolerance settings
> ---------------------------
>
>                 Key: PDFBOX-3019
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3019
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>             Fix For: 2.0.0
>
>
> From testing on my internal dataset I believe there might be some regression 
> in the effectiveness of PDFTextStripper.
> Here's an [example 
> doc|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
>  I found on the web, which converted better in 1.8 than 2.0. Notice that it 
> extracts "J e a n e t t e  A c o s t a ;  S e r v i c e  M a n a g e r  a t  
> M a d  F o x  B r e w i n g  C o m p a n y". It doesn't seem like there's 
> very much space between the letters in the pdf, so it's curious to me that it 
> didn't do too well.
> I realize this is an area where we probably can't strive for perfection. Yet, 
> it does seem to me that from 1.8 to 2.0 we may have taken a step backwards. I 
> believe there's some sort of regression test for PDFToImage which exports a 
> set of pdfs to images at two different commits and looks at what the 
> differences are. Do we have the same sort of thing for PDFTextStripper? If 
> not, can we build one by pulling docs off the public web? I'd be willing to 
> contribute to this endeavor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to