Analyzer for supporting hyphenated words

Diego Socaceti Fri, 17 Jul 2015 01:42:36 -0700

Hi all,

i'm new to lucene and tried to write my own analyzer to support
hyphenated words like wi-fi, jean-pierre, etc.
For our customer it is important to find the word
- wi-fi by wi, fi, wifi, wi-fi
- jean-pierre by jean, pierre, jean-pierre, jean-*





The analyzer:
public class SupportHyphenatedWordsAnalyzer extends Analyzer {

  protected NormalizeCharMap charConvertMap;

  public MinLuceneAnalyzer() {
    initCharConvertMap();
  }

  protected void initCharConvertMap() {
    NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
    builder.add("\"", "");
    charConvertMap = builder.build();
  }

  @Override
  protected TokenStreamComponents createComponents(final String fieldName) {

    final Tokenizer src = new WhitespaceTokenizer();

    TokenStream tok = new WordDelimiterFilter(src,
        WordDelimiterFilter.PRESERVE_ORIGINAL
            | WordDelimiterFilter.GENERATE_WORD_PARTS
            | WordDelimiterFilter.GENERATE_NUMBER_PARTS
            | WordDelimiterFilter.CATENATE_WORDS,
        null);
    tok = new LowerCaseFilter(tok);
    tok = new LengthFilter(tok, 1, 255);
    tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);

    return new TokenStreamComponents(src, tok);
  }

  @Override
  protected Reader initReader(String fieldName, Reader reader) {
    return new MappingCharFilter(charConvertMap, reader);
  }
}





The analyzer seems to work except for exact phrase match queries.

e.g. the following words are indexed

FD-A320-REC-SIM-1
FD-A320-REC-SIM-10
FD-A320-REC-SIM-11
MIA-FD-A320-REC-SIM-1
SIN-FD-A320-REC-SIM-1


The (exact) query "FD-A320-REC-SIM-1" returns
FD-A320-REC-SIM-1
MIA-FD-A320-REC-SIM-1
SIN-FD-A320-REC-SIM-1

for our customer this is wrong because this exact phrase match
query should only return the single entry FD-A320-REC-SIM-1

Do you have any ideas or tips, how we have to change our current
analyzer to support this requirement???


Thanks and Kind regards
Diego

Analyzer for supporting hyphenated words

Reply via email to