Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "TikaEval" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaEval?action=diff&rev1=3&rev2=4

  This module is designed to be used to help with Tika, but it could be used to 
evaluate other tools as well.
  
  = Background =
+ 
+ There are many tools for extracting text from various file formats, and even 
within a single tool there are usually countless parameters that can be tweaked.
+ The goal of 'tika-eval' is to allow integrators to quickly compare the output 
of:
+  1. Two different tools
+  2. Two versions of the same tool ("Should we upgrade?  Or are there problems 
with the newer version?")
+  3. Two runs with the same tool but with different settings ("Does increasing 
the DPI for OCR improve extraction? Let's try two runs, one with 200 DPI and 
one with 300")
+  4. Different tools against a gold standard
+ 
+ In addition to this "comparison mode", there is also plenty of information 
one can get from looking at a single run.
+ 
+ Some basic metrics might include:
+ 
+  * Exceptions -- how many and of what category?  Are these regular catchable 
exceptions, evil OOMs or permahangs?
+  * Metadata -- how many metadata values did we extract?
+  * Embedded files/attachments -- how many embedded files were found
+  * Mime detection -- how many of what types of files do we have?  Where do we 
see discrepancies between tools?
+  * Content -- is the content extracted by tool A better than that extracted 
by tool B?  What languages do we have
  
  = Quick Start Usage =
  

Reply via email to