On Fri, Nov 19, 2010 at 6:45 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > The only proviso is that stemming and word segmentation might break if you > change characters before stemming. I don't think that would happen in > English, French, Spanish, German, the Slavic languages that use Latin > characters or the Scandinavian languages. I am not entirely sure about > Finnish and Hungarian.
removing accents before stemming *will* break stemmers in basically all of these languages, depending upon the stemmer. For the snowball stemmers especially, the rules/affix lists are sensitive to diacritics. You can see this in the description of the rules here (example french): http://snowball.tartarus.org/algorithms/french/stemmer.html I disagree with Hoss on this issue, removing diacritics in a filter is not going to "mess up highlighting". The offsets are set by the tokenizer. So its no different than stemming or any other process. The *only* situation where you should use a CharFilter, is when you must change this stuff before the tokenizer.