Re: DO NOT REPLY [Bug 17954] - no hits when doing wildcard queries with words containing german umlauts

Tatu Saloranta Thu, 13 Mar 2003 21:19:55 -0800

On Thursday 13 March 2003 10:13, [EMAIL PROTECTED] wrote:
[ok ok, I'll be replying against the warnings]
>
> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=17954
>
> no hits when doing wildcard queries with words containing german umlauts
...
> ------- Additional Comments From [EMAIL PROTECTED]  2003-03-13 17:13 -------
> Oh, I meant the test case that includes the code.
> Since you sent HTML with umlauts, my guess is that something changes the
> tokens with umlauts on their way into the indexer (e.g. HTML parser, your
> analyzer, something else)
>
> I'm tempted to close this bug as INVALID, so please send self-enclosed code
> sample that includes indexing and searching part and demonstrates the
> problem you are describing.


Yes, it's very likely it's the difference between content that gets indexed 
through analyser, and prefix/wildcard query that doesn't get analysed.

Perhaps QueryParser just needs to have (optional) secondary
Analyzer (or perhaps two, actually, as prefix queries are easier to tokenize 
than full wildcard queries) that can be set to make these terms analysed
properly. As was previously discussed, using just standard analyser is (and 
can not be) 100% reliable, but some experience suggested that it often works 
well enough (using simple heuristics).

If anyone wants to work on this, another very useful piece would then be 
WildcardAnalyzer that would not consider '*' and '?' to be stop chars but, 
say, just normal word charaters. Combine this with lowercasing, and in case 
of German, umlaut removal, and the problem reported should be solvable?

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DO NOT REPLY [Bug 17954] - no hits when doing wildcard queries with words containing german umlauts

Reply via email to