[Tika Wiki] Update of "TikaEval" by TimothyAllison

Apache Wiki Tue, 08 Jan 2019 12:46:37 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "TikaEval" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaEval?action=diff&rev1=18&rev2=19

   * Mime detection -- how many of what types of files do we have?  Where do we 
see discrepancies between tools?
   * Content -- is the content extracted by tool A better than that extracted 
by tool B?  On which files is there a big difference in extracted content?
  
+ The tika-eval module was initially developed for text only.  For those 
interested in evaluating structure/style components (e.g. <title/> or <b/> 
elements), see TikaEvalAndStructuralComponents.
+ 
  = Quick Start Usage =
  
  '''NOTE:''' tika-eval will not overwrite the contents of the database you 
specify in Profile or Compare mode.  Add `-drop` to the commandline to drop 
tables if you are reusing the database.
@@ -33, +35 @@

  The following assumes that you are using the default in-memory H2 database.  
To connect tika-eval to your own db via jdbc see TikaEvalJdbc.
  
  == Single Output from One Tool (Profile) ==
+ '''NOTE:''' assume the original input files are in a directory named 
`input_docs` and that the text extracts are written to the `extracts` 
directory, with each extract file having the same sub-directory path and same 
file name with '.json' or '.txt' appended to it.
+ 
-  1. Create a directory of extract files that mirrors your input directory. 
These files may be UTF-8 text files with '.txt' appended to the original file's 
name or they may be the !RecursiveParserWrapper's '.json' representation from 
tika-app's '-J -t' option.
+  1. Create a directory of extract files that mirrors your input directory. 
These files may be UTF-8 text files with '.txt' appended to the original file's 
name or they may be the !RecursiveParserWrapper's '.json' representation: `java 
-jar tika-app-X.Y.jar -J -t -i input_docs -o extracts`
   
   2. Profile the directory of extracts and create a local H2 database: 
-     `java -jar tika-eval-X.Y.jar Profile -extracts json -db profiledb`
+     `java -jar tika-eval-X.Y.jar Profile -extracts extracts -db profiledb`
   
-  3.#3 Write reports from the database:
+  3. Write reports from the database:
  
      `java -jar tika-eval-X.Y.jar Report -db profiledb`
  
  You'll have a directory of .xlsx reports under the "reports" directory.
  
  == Comparing Output from Two Tools/Settings (Compare) ==
+ '''NOTE:''' assume the original input files are in a directory named 
`input_docs` and that the text extracts from tool A are written to the 
`extractsA` directory and the extracts from tool B are written to `extractsB`.
  
-  1. Create two directories of extract files that mirror your input directory. 
These files may be UTF-8 text files with '.txt' appended to the original file's 
name or they may be the !RecursiveParserWrapper's '.json' representation from 
tika-app's '-J -t' option.
+  1. Create two directories of extract files that mirror your input directory. 
These files may be UTF-8 text files with '.txt' appended to the original file's 
name or they may be the !RecursiveParserWrapper's '.json' representation.
   
   2. Compare the extract directory A with extract directory B and write 
results to a local H2 database:
-     `java -jar tika-eval-X.Y.jar Compare -extractsA tika_1_14 -extractsB 
tika_1_15 -db comparisondb`
+     `java -jar tika-eval-X.Y.jar Compare -extractsA extractsA -extractsB 
extractsB -db comparisondb`
   
   3.#3 Write reports from the database:
      `java -jar tika-eval-X.Y.jar Report -db comparisondb`
@@ -58, +63 @@

  
  == Investigating the Database ==
  
-  1. Fire up the H2 localhost server:
+  1. Launch the H2 localhost server:
      `java -jar tika-eval-X.Y.jar StartDB` -- this calls `java -cp ... 
org.h2.tools.Console -web`
-  2.#2 Navigate a browser to `http://localhost:8082` and enter the jdbc 
connector code followed by the '''full path''' to your db file:
+  2. Navigate a browser to `http://localhost:8082` and enter the jdbc 
connector code followed by the '''full path''' to your db file:
      `jdbc:h2:/C:/users/someone/mystuff/tika-eval/comparisondb`
  
  If your reaction is: "You call this a database?!", please open tickets and 
contribute to improving the structure.

[Tika Wiki] Update of "TikaEval" by TimothyAllison

Reply via email to