Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaEval" page has been changed by TimothyAllison: https://wiki.apache.org/tika/TikaEval?action=diff&rev1=18&rev2=19 * Mime detection -- how many of what types of files do we have? Where do we see discrepancies between tools? * Content -- is the content extracted by tool A better than that extracted by tool B? On which files is there a big difference in extracted content? + The tika-eval module was initially developed for text only. For those interested in evaluating structure/style components (e.g. <title/> or <b/> elements), see TikaEvalAndStructuralComponents. + = Quick Start Usage = '''NOTE:''' tika-eval will not overwrite the contents of the database you specify in Profile or Compare mode. Add `-drop` to the commandline to drop tables if you are reusing the database. @@ -33, +35 @@ The following assumes that you are using the default in-memory H2 database. To connect tika-eval to your own db via jdbc see TikaEvalJdbc. == Single Output from One Tool (Profile) == + '''NOTE:''' assume the original input files are in a directory named `input_docs` and that the text extracts are written to the `extracts` directory, with each extract file having the same sub-directory path and same file name with '.json' or '.txt' appended to it. + - 1. Create a directory of extract files that mirrors your input directory. These files may be UTF-8 text files with '.txt' appended to the original file's name or they may be the !RecursiveParserWrapper's '.json' representation from tika-app's '-J -t' option. + 1. Create a directory of extract files that mirrors your input directory. These files may be UTF-8 text files with '.txt' appended to the original file's name or they may be the !RecursiveParserWrapper's '.json' representation: `java -jar tika-app-X.Y.jar -J -t -i input_docs -o extracts` 2. Profile the directory of extracts and create a local H2 database: - `java -jar tika-eval-X.Y.jar Profile -extracts json -db profiledb` + `java -jar tika-eval-X.Y.jar Profile -extracts extracts -db profiledb` - 3.#3 Write reports from the database: + 3. Write reports from the database: `java -jar tika-eval-X.Y.jar Report -db profiledb` You'll have a directory of .xlsx reports under the "reports" directory. == Comparing Output from Two Tools/Settings (Compare) == + '''NOTE:''' assume the original input files are in a directory named `input_docs` and that the text extracts from tool A are written to the `extractsA` directory and the extracts from tool B are written to `extractsB`. - 1. Create two directories of extract files that mirror your input directory. These files may be UTF-8 text files with '.txt' appended to the original file's name or they may be the !RecursiveParserWrapper's '.json' representation from tika-app's '-J -t' option. + 1. Create two directories of extract files that mirror your input directory. These files may be UTF-8 text files with '.txt' appended to the original file's name or they may be the !RecursiveParserWrapper's '.json' representation. 2. Compare the extract directory A with extract directory B and write results to a local H2 database: - `java -jar tika-eval-X.Y.jar Compare -extractsA tika_1_14 -extractsB tika_1_15 -db comparisondb` + `java -jar tika-eval-X.Y.jar Compare -extractsA extractsA -extractsB extractsB -db comparisondb` 3.#3 Write reports from the database: `java -jar tika-eval-X.Y.jar Report -db comparisondb` @@ -58, +63 @@ == Investigating the Database == - 1. Fire up the H2 localhost server: + 1. Launch the H2 localhost server: `java -jar tika-eval-X.Y.jar StartDB` -- this calls `java -cp ... org.h2.tools.Console -web` - 2.#2 Navigate a browser to `http://localhost:8082` and enter the jdbc connector code followed by the '''full path''' to your db file: + 2. Navigate a browser to `http://localhost:8082` and enter the jdbc connector code followed by the '''full path''' to your db file: `jdbc:h2:/C:/users/someone/mystuff/tika-eval/comparisondb` If your reaction is: "You call this a database?!", please open tickets and contribute to improving the structure.
