Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaEval" page has been changed by TimothyAllison: https://wiki.apache.org/tika/TikaEval?action=diff&rev1=8&rev2=9 = More detailed usage = == Evaluating Success via Common Words == - In the absence of ground truth, it is often helpful to count the number of common words that were extracted. Tilman Hausherr originally recommended this metric. + In the absence of ground truth, it is often helpful to count the number of common words that were extracted (see TikaEvalMetrics for a discussion of this). - For our initial collaboration with PDFBox, we found a list of common English words and removed those that had fewer than four characters. - The intuition is that if tool A extracts 500, but tool B extracts 1,000, there is ''some'' information that tool B may have done a better job. "Common words" are specified per language in the "resources/commonwords" directory. Each file is named for the language code, e.g. 'en', and each file is a UTF-8 text file with one word per line.
