Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
https://github.com/tballison/share/blob/master/tika_comparisons/reports_tika_20160904_dev.zip

This run was against the full corpus, not just PDFs.  I used a fairly recent 
nightly build of PDFBox and POI's 3.15-rc1.

The one apparent major new exception for PDF files was apparently fixed before 
2.0.3.  So, please ignore that one!

There are some regressions in content extraction, but overall, content extraction looks 
to have improved quite a bit.  Looks like ~2 million more "common English 
words" via Tilman's methodology.

Let me know if you have any questions.

I wonder what happened here:
commoncrawl2/SH/SHMSOEBK4QOJO5CY7BIWWDH6GHSTOXYM

metadata went from 6766 to 4134.

Is this a TIKA thing, or is this because of a change from xmpbox to jempbox?

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to