Hi,
Did you take a look at IsoLatin1AccentFilter ?
Patrick
On 11/6/06, hans meiser <[EMAIL PROTECTED]> wrote:
Hi,
Lucene indexes documents from 3 different countries here
(English, German and French). I want to normalize some
characters like umlauts. ä -> ae
I did it in the following way:
New Analyzer:
public class SpecialCharsAnalyzer extends StandardAnalyzer {
public SpecialCharsAnalyzer() {
}
public SpecialCharsAnalyzer(Set stopWords) {
super(stopWords);
}
public SpecialCharsAnalyzer(String[] stopWords) {
super(stopWords);
}
public SpecialCharsAnalyzer(File stopwords) throws IOException {
super(stopwords);
}
public SpecialCharsAnalyzer(Reader stopwords) throws IOException {
super(stopwords);
}
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream ts = super.tokenStream(fieldName, reader);
ts = new SpecialCharacterFilter(ts);
return ts;
}
}
Is the SpecialCharsAnalyzer::tokenStream implemented correctly?
New Filter:
public class SpecialCharacterFilter extends TokenFilter {
public SpecialCharacterFilter(TokenStream input) {
super(input);
}
@Override
public Token next() throws IOException {
Token t = input.next();
if (t == null)
return null;
String str = t.termText();
if (str.indexOf("ä") != -1) {
str = str.replaceAll("ä", "ae");
t = new Token(str, t.startOffset(), t.endOffset() + 1);
}
return t;
}
}
Is the SpecialCharacterFilter::next implemented correctly,
in case of the "ä"?
Is this way the correct way to do normalisation?
thx
---------------------------------
NEU: Fragen stellen - Wissen, Meinungen und Erfahrungen teilen. Jetzt auf
Yahoo! Clever.