On Oct 11, 2005, at 9:22 AM, Hugo Lafayette wrote:
- accentuated characters: The french analyzer keep accents, which could
be useful, but may also become boring. I just have to add the
ISOLatinFilter.java to correct that, but maybe adding an option to keep
them or not could be useful.

- apsotrophe (') characters: The standard analyzer does NOT tokenize on ('), because of O'Reilly like words. But in french, lot's of expression must be tokenize, like "j'aime" or "l'amour" which contains respectively 2 tokens each ("je" & "aime", "le" & "amour"). I'm quite surprised that
nobody else found that supicious behavior before, so maybe I missed
something.

Rather than changing StandardAnalyzer, you could create a custom Analyzer that is something along the lines of StandardTokenizer -> custom apostrophe splitting filter -> ISOLatinFilter. You get a special type for words with interior apostrophes from StandardTokenizer (look at StandardFilter to see how that works). You could create a simple TokenFilter that splits apostrophe'd tokens into two. Maybe it's simple enough also to expand "j" and "l" into "je" and "le" in the same step too?

    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to