[Tika Wiki] Update of "TikaEvalMetrics" by TimothyAllison

Apache Wiki Mon, 13 Feb 2017 13:09:42 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "TikaEvalMetrics" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaEvalMetrics

New page:
=tika-eval metrics=
The 

=Profiling Metrics=

==Common Words==
For a given language, count the number of "common words" extracted. If the 
assumption is that your documents generally contain natural language (e.g., not 
just parts lists or numbers), then calculating the number of common words 
extracted divided by the number of alphabetic words may offer some insight into 
how "languagey" the extracted text is.

Tilman Hausherr originally recommended this metric as a comparison metric when 
comparing the output from different versions of PDFBox. For our initial 
collaboration with PDFBox, we found a list of common English words and removed 
those that had fewer than four characters.  The intuition is that if tool A 
extracts 500, but tool B extracts 1,000, there is ''some'' information that 
tool B may have done a better job.

===Implementation Details===
For now, we've set up an Analyzer chain in Lucene that:
 * Filters out tokens that don't contain an alphabetic or ideographic character.
 * Maps urls to "__url__" and emails to "__email__" (We don't want to penalize 
documents with urls and emails).
 * Requires that a token be at least 4 characters long ''unless'' it is 
comprised entirely of CJK characters.

But wait, what's a word for non-whitespace (e.g. Chinese/Japanese) languages?  
We've followed common practice for non-whitespace languages of tokenizing 
bigrams...this is linguistically abhorrent, but it is mildly useful if 
inaccurate for our purposes.

'''Benefits''': Easy to implement.
'''Risks''': 
 * If an OCR engine relies solely on dictionary lookup and does not allow for 
out-of-vocabulary terms, the generated text will contain only known words, and 
the "common words" score will be incorrectly high.  Yes, the text contains 
known words, but they do '''not''' reflect the correct text.
 * If a document contains part numbers or other non-natural language tokens, 
then this metric will not accurately reflect success.
 * Multi-lingual documents can cause challenges for interpretation.  If the 
language id component "detects" English, even though the majority of the 
document is in Chinese, this metric will be misleading.


=Comparison Metrics=

[Tika Wiki] Update of "TikaEvalMetrics" by TimothyAllison

Reply via email to