[jira] [Commented] (PDFBOX-3019) Optimize tolerance settings

Tilman Hausherr (JIRA) Mon, 12 Oct 2015 23:24:03 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954479#comment-14954479
 ]


Tilman Hausherr commented on PDFBOX-3019:
-----------------------------------------

http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/

There are about 250000 PDF files within these. Very few of them are malware:
http://digitalcorpora.org/corp/files/govdocs1/MetascanClientLog_201306281214.txt



> Optimize tolerance settings
> ---------------------------
>
>                 Key: PDFBOX-3019
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3019
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Ben McCann
>             Fix For: 2.0.0
>
>
> From testing on my internal dataset I believe there might be some regression 
> in the effectiveness of PDFTextStripper.
> Here's an [example 
> doc|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
>  I found on the web, which converted better in 1.8 than 2.0. Notice that it 
> extracts "J e a n e t t e  A c o s t a ;  S e r v i c e  M a n a g e r  a t  
> M a d  F o x  B r e w i n g  C o m p a n y". It doesn't seem like there's 
> very much space between the letters in the pdf, so it's curious to me that it 
> didn't do too well.
> I realize this is an area where we probably can't strive for perfection. Yet, 
> it does seem to me that from 1.8 to 2.0 we may have taken a step backwards. I 
> believe there's some sort of regression test for PDFToImage which exports a 
> set of pdfs to images at two different commits and looks at what the 
> differences are. Do we have the same sort of thing for PDFTextStripper? If 
> not, can we build one by pulling docs off the public web? I'd be willing to 
> contribute to this endeavor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3019) Optimize tolerance settings

Reply via email to