https://github.com/tballison/share/blob/master/tika_comparisons/reports_tika_20160904_dev.zip

This run was against the full corpus, not just PDFs.  I used a fairly recent 
nightly build of PDFBox and POI's 3.15-rc1.

The one apparent major new exception for PDF files was apparently fixed before 
2.0.3.  So, please ignore that one!

There are some regressions in content extraction, but overall, content 
extraction looks to have improved quite a bit.  Looks like ~2 million more 
"common English words" via Tilman's methodology.

Let me know if you have any questions.

Cheers,

         Tim

-----Original Message-----
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Monday, September 12, 2016 12:58 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3?

Am 12.09.2016 um 18:47 schrieb Allison, Timothy B.:
> Let me know if/when to run a comparison between 2.0.3 and 2.0.1 (shipped w/ 
> Tika 1.13).

Yes please, when you have the time, I expect no more changes.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to