I wasn't sure this: Instead add the stopwords to the analyzer that > you pass to MoreLikeThis. That way you can ensure that the analyzer > applies the stopword list before stemming
would work, because I don't want to provide all the variants of the stopword list-- if I do this, only the one provided will be removed, correct? Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED] Mark Miller <[EMAIL PROTECTED]> wrote on 10/15/2007 10:37:22 AM: > Sounds right to me. > > The other option I think you have is to not use the MoreLikeThis > stopword functionality. Instead add the stopwords to the analyzer that > you pass to MoreLikeThis. That way you can ensure that the analyzer > applies the stopword list before stemming (The MoreLikeThis stopword > removal is implemented so that stopwords are removed after stemming). > Then you just have to add 'developer' to the stop list, and you can > forget about handling stemmed forms. > > Your method should also work though. > > - Mark > > Donna L Gresh wrote: > > Could those "in the know" comment on my current understanding of stemming > > and stopwords using the snowball analyzer? > > > > In my application, I am using the MoreLikeThis class to find similar > > documents to an input "text blob". There are words in the input text blob > > which are "uninteresting" for my application, so I create a list of these > > words. These words are "uninteresting" no matter what their tense or > > usage, for example, "develop", "developing", "developed", and "developer" > > are all uninteresting and I do not want them included in the search query > > created by the MoreLikeThis class. > > > > My index documents are stemmed using the Snowball analyzer. I do not use > > any stopwords when the documents are indexed (as I would like the choice > > of stopwords to be under user control at search time). > > > > I would like the user to be able to provide to the search application a > > list of "uninteresting" words, and for obvious reasons would like to force > > them to provide only, say, "developer" and have the application understand > > that all variants should be ignored (and I don't want to force them to try > > to guess what the stemmed version of "developer" is). > > > > My first try was to use MoreLikeThis with the Snowball analyzer and a > > simple list of unstemmed stopwords (MoreLikeThis.setAnalyzer and > > MoreLikeThis.setStopWords). However, it appears that the stopwords > > provided to the MoreLikeThis class are compared in an exact way to the > > token stream output by the Snowball filter (where the words have been > > stemmed), so "developer" will not match anything, and all variants pass > > through. Even if I provide the list of unstemmed stopwords to the snowball > > analyzer instead, they are used "as-is" with no stemming performed, so > > "developer" will not remove "developed". > > > > Apparently the following is necessary for my application: > > Construct a snowball analyzer with no stopwords. Use the unstemmed > > stopword list with the analyzer to construct a stemmed version of the set > > of stopwords. Use this set of stemmed stopwords as the stopwords input to > > the MoreLikeThis class (where the tokens are compared to the stemmed > > versions after been output from the Snowball analyzer). > > > > Is my understanding correct? > > > > Donna > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] >
