Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaEval" page has been changed by TimothyAllison: https://wiki.apache.org/tika/TikaEval?action=diff&rev1=3&rev2=4 This module is designed to be used to help with Tika, but it could be used to evaluate other tools as well. = Background = + + There are many tools for extracting text from various file formats, and even within a single tool there are usually countless parameters that can be tweaked. + The goal of 'tika-eval' is to allow integrators to quickly compare the output of: + 1. Two different tools + 2. Two versions of the same tool ("Should we upgrade? Or are there problems with the newer version?") + 3. Two runs with the same tool but with different settings ("Does increasing the DPI for OCR improve extraction? Let's try two runs, one with 200 DPI and one with 300") + 4. Different tools against a gold standard + + In addition to this "comparison mode", there is also plenty of information one can get from looking at a single run. + + Some basic metrics might include: + + * Exceptions -- how many and of what category? Are these regular catchable exceptions, evil OOMs or permahangs? + * Metadata -- how many metadata values did we extract? + * Embedded files/attachments -- how many embedded files were found + * Mime detection -- how many of what types of files do we have? Where do we see discrepancies between tools? + * Content -- is the content extracted by tool A better than that extracted by tool B? What languages do we have = Quick Start Usage =
