Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "TikaEval" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaEval?action=diff&rev1=5&rev2=6

   3. Two runs with the same tool but with different settings ("Does increasing 
the DPI for OCR improve extraction? Let's try two runs, one with 200 DPI and 
one with 300")
   4. Different tools against a gold standard
  
- In addition to this "comparison mode", there is also plenty of information 
one can get from looking at a single run.
+ In addition to this "comparison mode", there is also plenty of information 
one can get from looking at a profile of a single run.
  
- Some basic metrics might include:
+ Some basic metrics for both the "comparison" and "profiling" mode might 
include:
  
   * Exceptions -- how many and of what category?  Are these regular catchable 
exceptions, evil OOMs or permahangs?
   * Metadata -- how many metadata values did we extract?
   * Embedded files/attachments -- how many embedded files were found
   * Mime detection -- how many of what types of files do we have?  Where do we 
see discrepancies between tools?
-  * Content -- is the content extracted by tool A better than that extracted 
by tool B?  What languages do we have
+  * Content -- is the content extracted by tool A better than that extracted 
by tool B?  On which files is there a big difference in extracted content?
  
  = Quick Start Usage =
  

Reply via email to