Hi Rodrigo and Brian, The power of regex is desirable especially in the left and right context matching. As it is, you need to write a lot of little rules for every possible combination. A regex instead would allow for just one rule covering most of the combinations. For example, you have a rule that would remove the "ation(s)" at the end of a word. That creates a stem like "n" for "nation(s)". This kind of problem could be resolved by having a way to define units bigger than just one letter, for example a syllable.
The other feature that I have found useful is the possibility to create classes of sounds (letters). You go around it with enumeration -- sometimes it makes sense to be able to define groups of consonants or vowels etc.. But at the end, you are right, regex is too powerful. My point of view is that this tool will be used by people that once they spend the time to learn and understand it, they will always aim at covering as many linguistic exceptions as possible. The present limitations could become frustrating. Just my two lipas. Alex -----Original Message----- From: Rodrigo Reyes [mailto:[EMAIL PROTECTED]] Sent: Wednesday, March 13, 2002 2:02 PM To: Lucene Developers List Subject: Re: Normalization Hi Alex, > Would it make sense to allow a full regex in the matching part? Could > use regex or oromatcher packages. Don't know how that would affect > your hashing though... I'd give an answer not really different than Brian's : you don't really need all that power. Although I don't have significant experience with non-european languages, this is not the first tool of the kind I write, and to my knowledge you don't really need more power than that. At least, not the kind of additional expressiveness that can be provided by regexps (although, as I mentionned in another mail, you may need restriction on the size of the string input or output, for example soundex specifies a 4-letter limitation that is not currently addressed by the language). However, I'd be very interested in hearing about counter-example that would need. The only counter-example I could find was the annoyance of having to remove sequences of the same letter, which was unnice, so I added an option called "uniquify" to do the job more easely (as you can see in the soundex or french normalizer). Rodrigo -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
