RE: Normalization

Alex Murzaku Tue, 12 Mar 2002 13:10:27 -0800

Hi Rodrigo,

A couple of things that I should have warned you about in our discussion
yesterday.


The rules seem to be applied sequentially and each rule modifies the
output of the previous one. This is kind of risky especially if the rule
set becomes too big. The author of the rules needs to keep this present
at all times. For example, there is a rule for "ons$" and a following
one for "ions$". The second one will never be matched because the string
will be changed by the first rule it matches. Even though aimons and
aimions should be reduced to "em" they end up into "em" and "emi". Maybe
this could be solved if you do longest match first.

The other consequence of the sequentiality is the possible change of
context. Some rules could never be reached therefore. Don't remember how
we got around this.

Alex



-----Original Message-----
From: Rodrigo Reyes [mailto:[EMAIL PROTECTED]] 
Sent: Tuesday, March 12, 2002 3:18 PM
To: Lucene Developers List
Subject: Re: Normalization



> Anyway, I'll try to add a few comments in the sourcecode (although 
> it's
very
> small, like 8 small classes) and package it so that the lucene 
> developers can try it. Should be ready tomorrow.

Ok, please find enclosed hereby the archive of the normalizer. To
compile it, juste type "ant". To test the french normalizer just run
"ant test-french", or "ant test-soundex" for the soundex.

Rodrigo




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

RE: Normalization

Reply via email to