[jira] [Comment Edited] (PDFBOX-3019) Optimize tolerance settings

Tilman Hausherr (JIRA) Tue, 13 Oct 2015 10:03:10 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955273#comment-14955273
 ]


Tilman Hausherr edited comment on PDFBOX-3019 at 10/13/15 5:01 PM:
-------------------------------------------------------------------

No, not in the build. You would have to choose some files that extract properly 
and are not too big. Those that done deliver very variable crap, which makes it 
impossible to test.

What we do have is the tests by Tim Allison. These are not part of the build, 
but were helpful to find bugs by comparing the output to earlier versions.


was (Author: tilman):
No, not in the build. You would have to choose some files that extract 
properly. Those that done deliver very variable crap, which makes it impossible 
to test.

What we do have is the tests by Tim Allison. These are not part of the build, 
but were helpful to find bugs by comparing the output to earlier versions.

> Optimize tolerance settings
> ---------------------------
>
>                 Key: PDFBOX-3019
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3019
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>             Fix For: 2.0.0
>
>
> From testing on my internal dataset I believe there might be some regression 
> in the effectiveness of PDFTextStripper.
> Here's an [example 
> doc|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
>  I found on the web, which converted better in 1.8 than 2.0. Notice that it 
> extracts "J e a n e t t e  A c o s t a ;  S e r v i c e  M a n a g e r  a t  
> M a d  F o x  B r e w i n g  C o m p a n y". It doesn't seem like there's 
> very much space between the letters in the pdf, so it's curious to me that it 
> didn't do too well.
> I realize this is an area where we probably can't strive for perfection. Yet, 
> it does seem to me that from 1.8 to 2.0 we may have taken a step backwards. I 
> believe there's some sort of regression test for PDFToImage which exports a 
> set of pdfs to images at two different commits and looks at what the 
> differences are. Do we have the same sort of thing for PDFTextStripper? If 
> not, can we build one by pulling docs off the public web? I'd be willing to 
> contribute to this endeavor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-3019) Optimize tolerance settings

Reply via email to