I've been experimenting with the Porter and Snowball stemmers.  It seems to me that 
one of the most valuable benefits these provide is the capability to generalize phrase 
terms.  As a very simple example, without the stemmer, I might need to include three 
phrase terms in my query: "north korea", "north korean", "north koreans".  But with 
the stemmer only one will suffice.  To me, that's a huge advantage.  (For non-phrases, 
the advantage doesn't seem to be so great, because much the same effect can be 
achieved with wildcards.)

But there seems to be a price that you also pay, in that discrimination may be 
adversely affected.  If you want to discriminate between two terms that the stemmer 
views as derived from the same root, you're out of luck (I think).  The problem with 
this is that you may start with a set of terms that don't have this problem, but over 
time as new content is added to the index, such problems may gradually get introduced 
- often unpredictably.  And to the best of my (admittedly limited) knowledge, once 
you've indexed using a stemmer, there's no way to override it in specific instances.

Appreciate any comments, thoughts on the above.

Regards,

Terry
 

Reply via email to