Re: PDFBox 2.0.3 TIKA comparison

Tilman Hausherr Wed, 14 Sep 2016 09:52:26 -0700

Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:

https://github.com/tballison/share/blob/master/tika_comparisons/reports_tika_20160904_dev.zip


This run was against the full corpus, not just PDFs.  I used a fairly recent 
nightly build of PDFBox and POI's 3.15-rc1.

The one apparent major new exception for PDF files was apparently fixed before 
2.0.3.  So, please ignore that one!

There are some regressions in content extraction, but overall, content extraction looks 
to have improved quite a bit.  Looks like ~2 million more "common English 
words" via Tilman's methodology.

Let me know if you have any questions.


I wonder what happened here:
commoncrawl2/SH/SHMSOEBK4QOJO5CY7BIWWDH6GHSTOXYM

metadata went from 6766 to 4134.

Is this a TIKA thing, or is this because of a change from xmpbox to jempbox?

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: PDFBox 2.0.3 TIKA comparison

Reply via email to