On Thursday 13 February 2003 05:06, Mailing Lists Account wrote: > Doug Cutting wrote: > > Mailing Lists Account wrote: .. > > That's because Google and most internet search engines never do any > > stemming. > > > > Doug > > I didn't know that. Thanks. > > Generally speaking, are there any advantages not to apply the stemmer ?
Yes, I suspect there are. There are 2 ways to think about this. First is that Google, arguably the best current general purpose search engine in the world does not use it. This indicates in itself that perhaps stemming is not very useful for general indexing/searching. Especially when doing phrase searches. Second is that in case of internet search engines (or other search engines with massive amount of non-domain-specific data), stemming reduces accuracy of matching; and in case of huge data sets that's actually not a good thing. Instead of, say, 100 matches, you get 10000 matches, because stemming makes terms more general, matching more often. Trying to find a needle from haystack if you will. Stemming is probably more useful in reducing size of the index and improving performance that way. This used to be more important, when memory and performance limitations were stricter than nowadays. Also, if you want to do semantic mapping and correlation, stemming is very useful (esp. combined with extensive list of stop words), as minimizing data sets used for correlation is essential for acceptable performance. I think usefulness of stop words is closely related to usefulness of stemming (ie. more useful in some cases than others) > Except for certain keywords,I found use of stemmers helpful. I suspect this depends a lot on keywords in question. Unifying plurals and singulars is often helpful, but unifying words like "useful" and "useless" is, well, not very helpful (do they get stemmed to "use" like I would guess? or not?). Similarly, dropping stop words like "with", "without", "no"/"not" may result in dramatic loss in accuracy (ie. you get matches with pretty much "opposite" phrases when "not" is dropped by analyzer) What do others think? -+ Tatu +- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
