Re: Bad behaviors of FrenchAnalyzer

Erik Hatcher Tue, 11 Oct 2005 07:14:38 -0700


On Oct 11, 2005, at 9:22 AM, Hugo Lafayette wrote:

- accentuated characters: The french analyzer keep accents, whichcould
be useful, but may also become boring. I just have to add the
ISOLatinFilter.java to correct that, but maybe adding an option tokeep
them or not could be useful.
- apsotrophe (') characters: The standard analyzer does NOTtokenize on('), because of O'Reilly like words. But in french, lot's ofexpressionmust be tokenize, like "j'aime" or "l'amour" which containsrespectively2 tokens each ("je" & "aime", "le" & "amour"). I'm quite surprisedthat
nobody else found that supicious behavior before, so maybe I missed
something.

Rather than changing StandardAnalyzer, you could create a customAnalyzer that is something along the lines of StandardTokenizer ->custom apostrophe splitting filter -> ISOLatinFilter. You get aspecial type for words with interior apostrophes fromStandardTokenizer (look at StandardFilter to see how that works).You could create a simple TokenFilter that splits apostrophe'd tokensinto two. Maybe it's simple enough also to expand "j" and "l" into"je" and "le" in the same step too?


    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Bad behaviors of FrenchAnalyzer

Reply via email to