Re: Why does Nutch use n-grams in analysis?

Andrzej Bialecki Wed, 28 Dec 2005 12:55:11 -0800

Teruhiko Kurosaka wrote:

Andrzej,
Thank you for explanation.
No, in this case, if "web" and "services" were added tocommon-grams.utf8, the result would look like:
web|web-services, services|services-is, cool

where | marks tokens indexed at the same position in the index.
I guess you meant common-terms.utf8 rather?


Yes, sorry.

If so, Lucene indexes pairs of words that include
"a", "and", "for", etc. that are usually regarded as stop words
and simply thrown away by many search engines? That
is amazing.

None of the major search engines just throws away stop words ... Stopwords are needed to match phrase queries. Instead, search engines try tooptimize these cases that involve stop words, so that they take lessprocessing.

Please conduct the following experiment: go to Google, and run thesequeries:


1. a cat
2. the cat
3. cat the
4. "the cat"

And please note the estimated total hits, and also the first couple of hits.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Why does Nutch use n-grams in analysis?

Reply via email to