Why does Nutch use n-grams in analysis?

Teruhiko Kurosaka Wed, 28 Dec 2005 10:17:08 -0800

I thought n-grams are used for language identification only but
I see they are used in another area.


In the source code of CommonGramps and the API doc:
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/CommonG
rams.html
I see (tokens representing) n-grams are "inserted" to the token stream.

Does this mean a situation such as "Web Services is cool" is represented
by token sequence of {"Web", "Services", "Web Services", ("is"
ignored being a stop word), "cool"}, assuming "web services"
is a commonly used bi-gram? Or something else?

Why does Nutch do this?

-kuro

Why does Nutch use n-grams in analysis?

Reply via email to