[Tika Wiki] Update of "TikaEval" by TimothyAllison

Apache Wiki Fri, 10 Feb 2017 09:48:42 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "TikaEval" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaEval?action=diff&rev1=2&rev2=3

  = More detailed usage =
  
  == Evaluating Success via Common Words ==
- In the absence of ground truth, it is often helpful to count the number of 
common words that were extracted.  
+ In the absence of ground truth, it is often helpful to count the number of 
common words that were extracted.  Tilman Hausherr originally recommended this 
metric.
+ For our initial collaboration with PDFBox, we found a list of common English 
words and removed those that had fewer than four characters.
- If tool A extracts 500, but tool B extracts 1,000, there is ''some'' 
information that tool B did a better job.
+ The intuition is that if tool A extracts 500, but tool B extracts 1,000, 
there is ''some'' information that tool B may have done a better job.
- Tilman Hausherr originally recommended this metric.
  
  "Common words" are specified per language in the "resources/commonwords" 
directory.  
  Each file is named for the language code, e.g. 'en', and each file is a UTF-8 
text file with one word per line.
  
  The token processor runs language id against content and then selects the 
appropriate set of common words for its counts.  If there is no common words 
file for a language, then it backs off to the default list, which is currently 
hardcoded to 'en'.
  
- Make sure that your common words have gone through the same analysis chain as 
specified by the Common Words analyzer in 'analyzers.json'. 
+ Make sure that your common words have gone through the same analysis chain as 
specified by the Common Words analyzer in 'analyzers.json'! 
  
  == Reading Extracts ==

[Tika Wiki] Update of "TikaEval" by TimothyAllison

Reply via email to