[jira] [Commented] (PDFBOX-3019) Optimize tolerance settings

Maruan Sahyoun (JIRA) Sun, 11 Oct 2015 02:05:13 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952224#comment-14952224
 ]


Maruan Sahyoun commented on PDFBOX-3019:
----------------------------------------

There already is such a tool {{TestTextStripper.java}} which compares the text 
extraction with known content. A small set of documents though - which we can 
always extend.

In addition Apache Tika runs such a test with a very large testbed normally 
prior to us doing a new release. [~tilman] and [[email protected]] can 
elaborate a little more on that. The test suite is very good and helped us 
finding regressions in the past. There is a lot of analysis done. Extraction is 
not only done for 'pure' text but also annotations, form fields and metadata. 
Apache Tika also has their own VM to run that so it's a dedicated 
infrastructure.

> Optimize tolerance settings
> ---------------------------
>
>                 Key: PDFBOX-3019
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3019
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>             Fix For: 2.0.0
>
>
> From testing on my internal dataset I believe there might be some regression 
> in the effectiveness of PDFTextStripper.
> Here's an [example 
> doc|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
>  I found on the web, which converted better in 1.8 than 2.0. Notice that it 
> extracts "J e a n e t t e  A c o s t a ;  S e r v i c e  M a n a g e r  a t  
> M a d  F o x  B r e w i n g  C o m p a n y". It doesn't seem like there's 
> very much space between the letters in the pdf, so it's curious to me that it 
> didn't do too well.
> I realize this is an area where we probably can't strive for perfection. Yet, 
> it does seem to me that from 1.8 to 2.0 we may have taken a step backwards. I 
> believe there's some sort of regression test for PDFToImage which exports a 
> set of pdfs to images at two different commits and looks at what the 
> differences are. Do we have the same sort of thing for PDFTextStripper? If 
> not, can we build one by pulling docs off the public web? I'd be willing to 
> contribute to this endeavor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3019) Optimize tolerance settings

Reply via email to