On Thursday 13 March 2003 10:13, [EMAIL PROTECTED] wrote: [ok ok, I'll be replying against the warnings] > > http://nagoya.apache.org/bugzilla/show_bug.cgi?id=17954 > > no hits when doing wildcard queries with words containing german umlauts ... > ------- Additional Comments From [EMAIL PROTECTED] 2003-03-13 17:13 ------- > Oh, I meant the test case that includes the code. > Since you sent HTML with umlauts, my guess is that something changes the > tokens with umlauts on their way into the indexer (e.g. HTML parser, your > analyzer, something else) > > I'm tempted to close this bug as INVALID, so please send self-enclosed code > sample that includes indexing and searching part and demonstrates the > problem you are describing.
Yes, it's very likely it's the difference between content that gets indexed through analyser, and prefix/wildcard query that doesn't get analysed. Perhaps QueryParser just needs to have (optional) secondary Analyzer (or perhaps two, actually, as prefix queries are easier to tokenize than full wildcard queries) that can be set to make these terms analysed properly. As was previously discussed, using just standard analyser is (and can not be) 100% reliable, but some experience suggested that it often works well enough (using simple heuristics). If anyone wants to work on this, another very useful piece would then be WildcardAnalyzer that would not consider '*' and '?' to be stop chars but, say, just normal word charaters. Combine this with lowercasing, and in case of German, umlaut removal, and the problem reported should be solvable? -+ Tatu +- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
