Hi all, Maybe it we should start using stemming in a different maner. Look at it from the perspective of queryexpansion. In case we store stems in a different table, we will not have this problem!
So, each token in stored in the index as a term. Each term is stemmed with the appropriate stemmer Store each stem and unstemed term in a separate index. We could then, search using the terms entered, and firstfind all the terms that match the WildcardQuery. Next,you coulde use the terms found, and then stem them. >From there, you retrieve all the terms related to that stem! Finally, search for documents with all terms retrieved. This would give an extra option for end users, turning query expansion on or off. Your thoughts, please. kind regards, Maurits. ----- Original Message ----- From: "Tatu Saloranta" <[EMAIL PROTECTED]> To: "Lucene Developers List" <[EMAIL PROTECTED]> Sent: Thursday, February 13, 2003 2:43 AM Subject: Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms() > On Wednesday 12 February 2003 11:39, Christoph Kiehl wrote: > > Hi Doug, > > > > > Also, I think we should lowercase prefix and wildcard queries by > ... > > > wildcard searches. What do others think? > > > > For the StandardAnalyzer this might work, but for the GermanAnalyzer, there > > Solving this problem should be easier after refactoring, just > override 'getPrefixQuery()' and 'getWildcardQuery' (see below for one possible > idea of what could be done). > > Another possibility would be to have another property for enabling use of same > analyzer used for normal terms for wildcard/prefix queries. > > However, using typical analyzers is not something one usually wants to do > for couple of reasons: > > - Wildcards are discarded by analyzer, so wildcard query will get broken (ie. > one needs wildcard-char - aware analyzer) > - Stemming can only be done for prefix queries (what is stem of, > say, "h�*er"?), and even then it might not produce stem one would > want. For example, for prefix query "men*" might be 'stemmed' to > "man*", and user might be perplexed at why documents with > words like "meningitis" and "menstrual" did not match (ok, that is > a contrived example, but hope you get the idea). > In a way, you could think that user is doing "manual stemming", using > a stem of a word with prefix query. > > In case of german, if umlaut chars are typically converted, perhaps you could > create a GermanQueryParser.java that just extends default query parser, and > does necessary transformation for wildcard/prefix queries? Since there > already exists separate language-dependant stemmers, this might make sense? > > > is also the problem with Umlauts (�,�,�) turned into vowels (a,o,u) while > > indexing. An example: "H�user" is the plural of "Haus". If I index "H�user" > > it is stemmed to "hau". If I do for example a search for "h�us*" nothing is > > Not "haus"? > > > found, because "h�us" is not stemmed. If I would analyze "h�us*" I should > > get "hau*". The problem is, that now you do not only get "H�user" but also > > "Haus" as result. But I think it is better to get more results than no > > result. This is perhaps a special problem with the GermanAnalyzer. May be > > there could be an option to use the Analyzer also for wildcard queries. So > > I can turn it on in my case and defaults to off. > > Hope you understand my problem ;) > > Yes I do... I don't even dare to think of problems finnish analyzer might > have, with stemming. :-) > > -+ Tatu +- > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
