Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
https://github.com/tballison/share/blob/master/tika_comparisons/reports_tika_20160904_dev.zip
This run was against the full corpus, not just PDFs. I used a fairly recent
nightly build of PDFBox and POI's 3.15-rc1.
The one apparent major new exception for PDF files was apparently fixed before
2.0.3. So, please ignore that one!
There are some regressions in content extraction, but overall, content extraction looks
to have improved quite a bit. Looks like ~2 million more "common English
words" via Tilman's methodology.
Let me know if you have any questions.
I wonder what happened here:
commoncrawl2/SH/SHMSOEBK4QOJO5CY7BIWWDH6GHSTOXYM
metadata went from 6766 to 4134.
Is this a TIKA thing, or is this because of a change from xmpbox to jempbox?
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org