Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaEval" page has been changed by TimothyAllison: https://wiki.apache.org/tika/TikaEval?action=diff&rev1=2&rev2=3 = More detailed usage = == Evaluating Success via Common Words == - In the absence of ground truth, it is often helpful to count the number of common words that were extracted. + In the absence of ground truth, it is often helpful to count the number of common words that were extracted. Tilman Hausherr originally recommended this metric. + For our initial collaboration with PDFBox, we found a list of common English words and removed those that had fewer than four characters. - If tool A extracts 500, but tool B extracts 1,000, there is ''some'' information that tool B did a better job. + The intuition is that if tool A extracts 500, but tool B extracts 1,000, there is ''some'' information that tool B may have done a better job. - Tilman Hausherr originally recommended this metric. "Common words" are specified per language in the "resources/commonwords" directory. Each file is named for the language code, e.g. 'en', and each file is a UTF-8 text file with one word per line. The token processor runs language id against content and then selects the appropriate set of common words for its counts. If there is no common words file for a language, then it backs off to the default list, which is currently hardcoded to 'en'. - Make sure that your common words have gone through the same analysis chain as specified by the Common Words analyzer in 'analyzers.json'. + Make sure that your common words have gone through the same analysis chain as specified by the Common Words analyzer in 'analyzers.json'! == Reading Extracts ==
