Ahh... you are right about French, and Spanish should work minimally well. German should be fine.
On Sat, Nov 20, 2010 at 5:13 AM, Robert Muir <rcm...@gmail.com> wrote: > On Fri, Nov 19, 2010 at 6:45 PM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > The only proviso is that stemming and word segmentation might break if > you > > change characters before stemming. I don't think that would happen in > > English, French, Spanish, German, the Slavic languages that use Latin > > characters or the Scandinavian languages. I am not entirely sure about > > Finnish and Hungarian. > > removing accents before stemming *will* break stemmers in basically > all of these languages, depending upon the stemmer. > For the snowball stemmers especially, the rules/affix lists are > sensitive to diacritics. You can see this in the description of the > rules here (example french): > http://snowball.tartarus.org/algorithms/french/stemmer.html > > I disagree with Hoss on this issue, removing diacritics in a filter is > not going to "mess up highlighting". The offsets are set by the > tokenizer. So its no different than stemming or any other process. > The *only* situation where you should use a CharFilter, is when you > must change this stuff before the tokenizer. >