On Thursday 13 February 2003 07:49 am, Christoph Kiehl wrote: > Tatu Saloranta wrote: > > - Stemming can only be done for prefix queries (what is stem of, > > say, "h�*er"?), and even then it might not produce stem one would > > want. For example, for prefix query "men*" might be 'stemmed' to > > "man*", and user might be perplexed at why documents with > > words like "meningitis" and "menstrual" did not match (ok, that is > > a contrived example, but hope you get the idea). > > Good point. It's is really amazing how different and complex languages are > ;)
Given that this is the case, I don't think it's possible to come up with a solution that will cover every case. That said, I believe it is still worthwhile to try to do something reasonable to cover most cases. The company I work for has public text searchable websites in the following languages: English, Danish, Spanish, French, Dutch, Norwegian, Finnish, and Swedish. The approach we took, as I mentioned in an earlier mail, was to only stem prefix and "suffix" queries (of the form *someText). In these cases, don't pass the wildcard character to the stemmer and only use the stemmed result if it is a single word. We didn't have time to analyze all the stemming possibilities of each language and how our wildcard policy might perform in all cases. Instead, we just threw it out there and had the native speakers run their QA and see what happened. It turns out that this wildcard policy works well for us -- the users tend to get the results they expect. Whatever solution falls out of this argument, I just wanted to mention what is working for us. I'm thinking that adding a suffix term notion, parallel to prefix term in QueryParser.jj, creating subclassable methods to handle these, maybe providing a subclass that performs the imperfect stemming solution mentioned above, might be enough to please a lot of users. DaveB --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
