Re: Problem with tokenizing/stemming in GermanAnalyzer

Gerhard Schwarz Mon, 17 Feb 2003 08:43:39 -0800

Christoph Kiehl wrote:

Hi Volker,

I have noticed a strange problem with capitalization. Search for
"computer" results in the token "compu". Search for "Computer",
however, results in "comput". The search is supposed to be
case-insensitive, so this must be a bug, right?

This problem was already mentioned on the developer list. The analyzer tries
to do some noun recognition. But it does a bad job ;)

The analyzer should not do any case-recognition. After I read through the mailing list from the last weeks/months (I was busy last weeks), I found out that a super simple unique-discrimination algorithm is what the most users need. The original algorithm has more possible ways to extend it.

For now you could check out the current lucene version from cvs and just
comment out the following line:

 uppercase = Character.isUpperCase( term.charAt( 0 ) );

Then just run ant to built the jar. This fixes the problem you described.

I promise I will check the stemmer next days... hm... not before this weekend, i have a martial arts challenge at sunday. Mental i'm not prepared to _fix_ anything. :)

There is another problem with the Umlaut-conversion that also should be checked.

Greets,
Gerhard

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem with tokenizing/stemming in GermanAnalyzer

Reply via email to