Thanks for the suggestions I think Erick is correct as well. I'll let the
customer decide.
Here's an updated list. Fyi--the minStem was the English Minimal Stemmer--I
changed the label. Interesting to see where the minimal stemmer and porter
agree (and KStemmer doesn't). You may also find the "dog" examples
interesting. I also found the "invest*" list entertaining.
original porter kstem EngMinStem
----------- ----------- ----------- -----------
country countri country country
countries countri country country
country's country' country's country'
run run run run
runs run runs run
running run running running
read read read read
reading read reading reading
reader reader reader reader
association associ association association
associate associ associate associate
listing list list listing
water water water water
watered water water watered
sure sure sure sure
surely sure surely surely
invest invest invest invest
investing invest invest investing
investment invest investment investment
investments invest investment investment
invests invest invest invest
investor investor invest investor
invester invest invest invester
investors investor invest investor
investers invest invest invester
organization organ organization organization
organize organ organize organize
organic organ organic organic
generous gener generous generous
generic gener generic generic
dog dog dog dog
dog's dog' dog's dog'
dogs dog dogs dog
dogs' dog dogs dog
Now, if someone would answer my question on the Solr list ("Custom Solr
Indexer/Search"), my day would be complete ;-).
Thanks for the continued help.
Scott
-----Original Message-----
From: Tom Burton-West [mailto:[email protected]]
Sent: Thursday, November 15, 2012 11:06 AM
To: [email protected]
Subject: Re: Which stemmer?
I agree with Erick that you probably need to give your client a list of
concrete examples, and perhaps to explain the trade-offs.
All stemmers both overstem and understem. Understemming means that some
forms of a word won't get searched. For example, without stemming, searching
for "dogs" would not retrieve documents containing the word "dog".
Generally there is a precision/recall tradeoff where reducing understemming
increases overstemming. The problem with aggressive stemmers like the Porter
stemmer, is that they overstem.
The original Porter stemmer for example would stem "organization" and "
organic" both to "organ" and "generalization" , "generous"and "generic" to "
gener" *
For background on the Porter stemmers and lots of examples see these pages:
http*://snowball.tartarus.org/algorithms/porter/stemmer.html<http://snowball.tartarus.org/algorithms/porter/stemmer.html>
*
*http://snowball.tartarus.org/algorithms/english/stemmer.html*<http://snowball.tartarus.org/algorithms/english/stemmer.html>
This paper on the Kstem stemmer lists cases where the Porter stemmer understems
or overstems and explains the logic of Kstem: "Viewing Morphology as an
Inference Process" (*Krovetz*, R., Proceedings of the Sixteenth Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval, 191-203, 1993).
*http://ciir.cs.umass.edu/pubfiles/ir-35.pdf*<http://ciir.cs.umass.edu/pubfiles/ir-35.pdf>
"
Tom
http://www.hathitrust.org/blogs/large-scale-search
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]