On Oct 11, 2005, at 9:22 AM, Hugo Lafayette wrote:
- accentuated characters: The french analyzer keep accents, which
could
be useful, but may also become boring. I just have to add the
ISOLatinFilter.java to correct that, but maybe adding an option to
keep
them or not could be useful.
- apsotrophe (') characters: The standard analyzer does NOT
tokenize on
('), because of O'Reilly like words. But in french, lot's of
expression
must be tokenize, like "j'aime" or "l'amour" which contains
respectively
2 tokens each ("je" & "aime", "le" & "amour"). I'm quite surprised
that
nobody else found that supicious behavior before, so maybe I missed
something.
Rather than changing StandardAnalyzer, you could create a custom
Analyzer that is something along the lines of StandardTokenizer ->
custom apostrophe splitting filter -> ISOLatinFilter. You get a
special type for words with interior apostrophes from
StandardTokenizer (look at StandardFilter to see how that works).
You could create a simple TokenFilter that splits apostrophe'd tokens
into two. Maybe it's simple enough also to expand "j" and "l" into
"je" and "le" in the same step too?
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]