https://github.com/tballison/share/blob/master/tika_comparisons/reports_tika_20160904_dev.zip
This run was against the full corpus, not just PDFs. I used a fairly recent nightly build of PDFBox and POI's 3.15-rc1. The one apparent major new exception for PDF files was apparently fixed before 2.0.3. So, please ignore that one! There are some regressions in content extraction, but overall, content extraction looks to have improved quite a bit. Looks like ~2 million more "common English words" via Tilman's methodology. Let me know if you have any questions. Cheers, Tim -----Original Message----- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, September 12, 2016 12:58 PM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.3? Am 12.09.2016 um 18:47 schrieb Allison, Timothy B.: > Let me know if/when to run a comparison between 2.0.3 and 2.0.1 (shipped w/ > Tika 1.13). Yes please, when you have the time, I expect no more changes. Tilman --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org