I thought n-grams are used for language identification only but I see they are used in another area.
In the source code of CommonGramps and the API doc: http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/CommonG rams.html I see (tokens representing) n-grams are "inserted" to the token stream. Does this mean a situation such as "Web Services is cool" is represented by token sequence of {"Web", "Services", "Web Services", ("is" ignored being a stop word), "cool"}, assuming "web services" is a commonly used bi-gram? Or something else? Why does Nutch do this? -kuro
