Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaEval" page has been changed by TimothyAllison: https://wiki.apache.org/tika/TikaEval?action=diff&rev1=1&rev2=2 - = Overview of the 'tika-eval' Module= + = Overview of the 'tika-eval' Module = - While not yet available, this page offers a first draft of the documentation for the tika-eval module. + While the module is not yet available, this page offers a first draft of the documentation for the tika-eval module. The module is intended to enable some comparisons between tools or to gain insight from a single run. This module is designed to be used to help with Tika, but it could be used to evaluate other tools as well. @@ -27, +27 @@ 1. Create two directories of extract files that mirror your input directory. These files may be UTF-8 text files with '.txt' appended to the original file's name or they may be the !RecursiveParserWrapper's '.json' representation from tika-app's '-J -t' option. 2. Compare the extract directory A with extract directory B and write results to a local H2 database: - `java -jar tika-eval.X.Y.jar Profile -extractDirA tika_1_14 -extractDirB tika_1_15 -db comparisondb` + `java -jar tika-eval.X.Y.jar Compare -extractDirA tika_1_14 -extractDirB tika_1_15 -db comparisondb` 3.#3 Write reports from the database: `java -jar tika-eval.X.Y.jar Report -db comparisondb` @@ -46, +46 @@ = More detailed usage = + == Evaluating Success via Common Words == + In the absence of ground truth, it is often helpful to count the number of common words that were extracted. + If tool A extracts 500, but tool B extracts 1,000, there is ''some'' information that tool B did a better job. + Tilman Hausherr originally recommended this metric. + "Common words" are specified per language in the "resources/commonwords" directory. + Each file is named for the language code, e.g. 'en', and each file is a UTF-8 text file with one word per line. + + The token processor runs language id against content and then selects the appropriate set of common words for its counts. If there is no common words file for a language, then it backs off to the default list, which is currently hardcoded to 'en'. + + Make sure that your common words have gone through the same analysis chain as specified by the Common Words analyzer in 'analyzers.json'. + + == Reading Extracts == + + === alterMetadata === + Let's say you want to compare the output of Tika to another tool that extracts text. You happen to have a directory of .json files for Tika and a directory of UTF-8 .txt files from the other tool. + + 1. If the other tool extracts embedded content, you'd want to concatenate all the content within Tika's .json file for a fair comparison: + `java -jar tika-eval.X.Y.jar Compare -extractDirA tika_1_14 -extractDirB tika_1_15 -db comparisondb -alterMetadata concatenate_content` + + 2.#2 If the other tool does not extract embedded content, you'd only want to look at the first metadata object (representing the container file) in the .json file: + `java -jar tika-eval.X.Y.jar Compare -extractDirA tika_1_14 -extractDirB tika_1_15 -db comparisondb -alterMetadata first_only` == Reports == The module tika-eval comes with a list of reports. However, you might want to generate your own. Each report is specified by sql and a few other configurations in an xml file. See `comparison-reports.xml` and `profile-reports.xml` to get a sense of the syntax.
