Hi there, I just test the french analyzer, which works well for most part of it (Stemmer particulary). But ATM, I have two unexpected behavior with the default configuration:
- accentuated characters: The french analyzer keep accents, which could be useful, but may also become boring. I just have to add the ISOLatinFilter.java to correct that, but maybe adding an option to keep them or not could be useful. - apsotrophe (') characters: The standard analyzer does NOT tokenize on ('), because of O'Reilly like words. But in french, lot's of expression must be tokenize, like "j'aime" or "l'amour" which contains respectively 2 tokens each ("je" & "aime", "le" & "amour"). I'm quite surprised that nobody else found that supicious behavior before, so maybe I missed something. Anyway I don't know how to proceed, since I have to index both english and french text. The simple way will be to change the standard analyzer grammar (remove the APOSTROPHE rules basically), to get 2 tokens. But I'm afraid of unexpected side effects. The other way will be to make the french analyzer further tokenize "j'aime" into 2 sub tokens (with a token buffer, right ?). Is it the right thing to do ? Does this represent a bug that will be corrected soon ? Is there other way around ? Thanks in advance for your answers, and congrats for your delightful software ! PS: I'm working with "lucene-1.9-rc1-dev" version from the svn repository. -- Hugo --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]