Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaEval" page has been changed by TimothyAllison: https://wiki.apache.org/tika/TikaEval?action=diff&rev1=5&rev2=6 3. Two runs with the same tool but with different settings ("Does increasing the DPI for OCR improve extraction? Let's try two runs, one with 200 DPI and one with 300") 4. Different tools against a gold standard - In addition to this "comparison mode", there is also plenty of information one can get from looking at a single run. + In addition to this "comparison mode", there is also plenty of information one can get from looking at a profile of a single run. - Some basic metrics might include: + Some basic metrics for both the "comparison" and "profiling" mode might include: * Exceptions -- how many and of what category? Are these regular catchable exceptions, evil OOMs or permahangs? * Metadata -- how many metadata values did we extract? * Embedded files/attachments -- how many embedded files were found * Mime detection -- how many of what types of files do we have? Where do we see discrepancies between tools? - * Content -- is the content extracted by tool A better than that extracted by tool B? What languages do we have + * Content -- is the content extracted by tool A better than that extracted by tool B? On which files is there a big difference in extracted content? = Quick Start Usage =
