[
https://issues.apache.org/jira/browse/PDFBOX-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954479#comment-14954479
]
Tilman Hausherr commented on PDFBOX-3019:
-----------------------------------------
http://digitalcorpora.org/corp/nps/files/govdocs1/zipfiles/
There are about 250000 PDF files within these. Very few of them are malware:
http://digitalcorpora.org/corp/files/govdocs1/MetascanClientLog_201306281214.txt
> Optimize tolerance settings
> ---------------------------
>
> Key: PDFBOX-3019
> URL: https://issues.apache.org/jira/browse/PDFBOX-3019
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.0
> Reporter: Ben McCann
> Fix For: 2.0.0
>
>
> From testing on my internal dataset I believe there might be some regression
> in the effectiveness of PDFTextStripper.
> Here's an [example
> doc|http://rampages.us/rhodesc1/wp-content/uploads/sites/4737/2014/07/Resume-Connor-Rhodes.pdf]
> I found on the web, which converted better in 1.8 than 2.0. Notice that it
> extracts "J e a n e t t e A c o s t a ; S e r v i c e M a n a g e r a t
> M a d F o x B r e w i n g C o m p a n y". It doesn't seem like there's
> very much space between the letters in the pdf, so it's curious to me that it
> didn't do too well.
> I realize this is an area where we probably can't strive for perfection. Yet,
> it does seem to me that from 1.8 to 2.0 we may have taken a step backwards. I
> believe there's some sort of regression test for PDFToImage which exports a
> set of pdfs to images at two different commits and looks at what the
> differences are. Do we have the same sort of thing for PDFTextStripper? If
> not, can we build one by pulling docs off the public web? I'd be willing to
> contribute to this endeavor.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]