[Tika Wiki] Update of "TikaEval" by TimothyAllison

Apache Wiki Fri, 10 Feb 2017 09:46:43 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "TikaEval" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaEval?action=diff&rev1=1&rev2=2

- = Overview of the 'tika-eval' Module=
+ = Overview of the 'tika-eval' Module =
- While not yet available, this page offers a first draft of the documentation 
for the tika-eval module.
+ While the module is not yet available, this page offers a first draft of the 
documentation for the tika-eval module.
  
  The module is intended to enable some comparisons between tools or to gain 
insight from a single run.  
  This module is designed to be used to help with Tika, but it could be used to 
evaluate other tools as well.
@@ -27, +27 @@

   1. Create two directories of extract files that mirror your input directory. 
These files may be UTF-8 text files with '.txt' appended to the original file's 
name or they may be the !RecursiveParserWrapper's '.json' representation from 
tika-app's '-J -t' option.
   
   2. Compare the extract directory A with extract directory B and write 
results to a local H2 database:
-     `java -jar tika-eval.X.Y.jar Profile -extractDirA tika_1_14 -extractDirB 
tika_1_15 -db comparisondb`
+     `java -jar tika-eval.X.Y.jar Compare -extractDirA tika_1_14 -extractDirB 
tika_1_15 -db comparisondb`
   
   3.#3 Write reports from the database:
      `java -jar tika-eval.X.Y.jar Report -db comparisondb`
@@ -46, +46 @@

  
  = More detailed usage =
  
+ == Evaluating Success via Common Words ==
+ In the absence of ground truth, it is often helpful to count the number of 
common words that were extracted.  
+ If tool A extracts 500, but tool B extracts 1,000, there is ''some'' 
information that tool B did a better job.
+ Tilman Hausherr originally recommended this metric.
  
+ "Common words" are specified per language in the "resources/commonwords" 
directory.  
+ Each file is named for the language code, e.g. 'en', and each file is a UTF-8 
text file with one word per line.
+ 
+ The token processor runs language id against content and then selects the 
appropriate set of common words for its counts.  If there is no common words 
file for a language, then it backs off to the default list, which is currently 
hardcoded to 'en'.
+ 
+ Make sure that your common words have gone through the same analysis chain as 
specified by the Common Words analyzer in 'analyzers.json'. 
+ 
+ == Reading Extracts ==
+ 
+ === alterMetadata ===
+ Let's say you want to compare the output of Tika to another tool that 
extracts text.  You happen to have a directory of .json files for Tika and a 
directory of UTF-8 .txt files from the other tool.
+ 
+  1. If the other tool extracts embedded content, you'd want to concatenate 
all the content within Tika's .json file for a fair comparison:
+     `java -jar tika-eval.X.Y.jar Compare -extractDirA tika_1_14 -extractDirB 
tika_1_15 -db comparisondb -alterMetadata concatenate_content`
+  
+  2.#2 If the other tool does not extract embedded content, you'd only want to 
look at the first metadata object (representing the container file) in the 
.json file:
+     `java -jar tika-eval.X.Y.jar Compare -extractDirA tika_1_14 -extractDirB 
tika_1_15 -db comparisondb -alterMetadata first_only`
  
  == Reports ==
  The module tika-eval comes with a list of reports.  However, you might want 
to generate your own.  Each report is specified by sql and a few other 
configurations in an xml file.  See `comparison-reports.xml` and 
`profile-reports.xml` to get a sense of the syntax.

[Tika Wiki] Update of "TikaEval" by TimothyAllison

Reply via email to