Hi all,
i'm new to lucene and tried to write my own analyzer to support
hyphenated words like wi-fi, jean-pierre, etc.
For our customer it is important to find the word
- wi-fi by wi, fi, wifi, wi-fi
- jean-pierre by jean, pierre, jean-pierre, jean-*
The analyzer:
public class SupportHyphenatedWordsAnalyzer extends Analyzer {
protected NormalizeCharMap charConvertMap;
public MinLuceneAnalyzer() {
initCharConvertMap();
}
protected void initCharConvertMap() {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add("\"", "");
charConvertMap = builder.build();
}
@Override
protected TokenStreamComponents createComponents(final String fieldName) {
final Tokenizer src = new WhitespaceTokenizer();
TokenStream tok = new WordDelimiterFilter(src,
WordDelimiterFilter.PRESERVE_ORIGINAL
| WordDelimiterFilter.GENERATE_WORD_PARTS
| WordDelimiterFilter.GENERATE_NUMBER_PARTS
| WordDelimiterFilter.CATENATE_WORDS,
null);
tok = new LowerCaseFilter(tok);
tok = new LengthFilter(tok, 1, 255);
tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(src, tok);
}
@Override
protected Reader initReader(String fieldName, Reader reader) {
return new MappingCharFilter(charConvertMap, reader);
}
}
The analyzer seems to work except for exact phrase match queries.
e.g. the following words are indexed
FD-A320-REC-SIM-1
FD-A320-REC-SIM-10
FD-A320-REC-SIM-11
MIA-FD-A320-REC-SIM-1
SIN-FD-A320-REC-SIM-1
The (exact) query "FD-A320-REC-SIM-1" returns
FD-A320-REC-SIM-1
MIA-FD-A320-REC-SIM-1
SIN-FD-A320-REC-SIM-1
for our customer this is wrong because this exact phrase match
query should only return the single entry FD-A320-REC-SIM-1
Do you have any ideas or tips, how we have to change our current
analyzer to support this requirement???
Thanks and Kind regards
Diego