I agree with Erick that you probably need to give your client a list of
concrete examples, and perhaps to explain the trade-offs.

All stemmers both overstem and understem.   Understemming means that some
forms of a word won’t get searched.  For example, without stemming, searching
for “dogs” would not retrieve documents containing the word “dog”.
Generally there is a precision/recall tradeoff where reducing understemming
increases overstemming.  The problem with aggressive stemmers like the
Porter stemmer, is that they overstem.

 The original Porter stemmer for example would stem “organization” and “
organic” both to “organ” and “generalization” , “generous”and “generic” to “
gener”  *

For background on the Porter stemmers and lots of examples see these pages:

http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*

*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>

This paper on the Kstem stemmer lists cases where the Porter stemmer
understems or overstems and explains the logic of Kstem: "Viewing
Morphology as an Inference Process"  (*Krovetz*, R., Proceedings of the
Sixteenth Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 191-203, 1993).

*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"

Tom

http://www.hathitrust.org/blogs/large-scale-search

Reply via email to