Teruhiko Kurosaka wrote:
Andrzej,
Thank you for explanation.
No, in this case, if "web" and "services" were added to
common-grams.utf8, the result would look like:
web|web-services, services|services-is, cool
where | marks tokens indexed at the same position in the index.
I guess you meant common-terms.utf8 rather?
Yes, sorry.
If so, Lucene indexes pairs of words that include
"a", "and", "for", etc. that are usually regarded as stop words
and simply thrown away by many search engines? That
is amazing.
None of the major search engines just throws away stop words ... Stop
words are needed to match phrase queries. Instead, search engines try to
optimize these cases that involve stop words, so that they take less
processing.
Please conduct the following experiment: go to Google, and run these
queries:
1. a cat
2. the cat
3. cat the
4. "the cat"
And please note the estimated total hits, and also the first couple of hits.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com