Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "TikaEval" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaEval?action=diff&rev1=8&rev2=9

  = More detailed usage =
  
  == Evaluating Success via Common Words ==
- In the absence of ground truth, it is often helpful to count the number of 
common words that were extracted.  Tilman Hausherr originally recommended this 
metric.
+ In the absence of ground truth, it is often helpful to count the number of 
common words that were extracted (see TikaEvalMetrics for a discussion of this).
- For our initial collaboration with PDFBox, we found a list of common English 
words and removed those that had fewer than four characters.
- The intuition is that if tool A extracts 500, but tool B extracts 1,000, 
there is ''some'' information that tool B may have done a better job.
  
  "Common words" are specified per language in the "resources/commonwords" 
directory.  
  Each file is named for the language code, e.g. 'en', and each file is a UTF-8 
text file with one word per line.

Reply via email to