Teruhiko Kurosaka wrote:

Andrzej,
Thank you for explanation.

No, in this case, if "web" and "services" were added to common-grams.utf8, the result would look like:

web|web-services, services|services-is, cool

where | marks tokens indexed at the same position in the index.

I guess you meant common-terms.utf8 rather?

Yes, sorry.

If so, Lucene indexes pairs of words that include
"a", "and", "for", etc. that are usually regarded as stop words
and simply thrown away by many search engines? That
is amazing.

None of the major search engines just throws away stop words ... Stop words are needed to match phrase queries. Instead, search engines try to optimize these cases that involve stop words, so that they take less processing.

Please conduct the following experiment: go to Google, and run these queries:

1. a cat
2. the cat
3. cat the
4. "the cat"

And please note the estimated total hits, and also the first couple of hits.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to