On 07/15/2013 07:50 PM, Malgorzata Urbanska wrote:
> Hi,
>
> I've been trying to figure out how to use ngrams in Lucene 4.3.0
> I found some examples for earlier version but I'm still confused.
> How I understand it, I should:
> 1. create a new analyzer which uses ngrams
> 2. apply it to my indexer
> 3. search using the same analyzer
>
> I found in a documentation: NGramTokenFilter and NGramTokenizer, but I
> do not understand what is the difference between them.
This should be helpful:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Tokenizers
Here is example of n-gram analyzer:
public class NGramAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
Tokenizer src = new NGramTokenizer(reader, 3, 3);
TokenStream tok = new StandardFilter(Version.LUCENE_43, src);
tok = new LowerCaseFilter(Version.LUCENE_43, tok);
return new TokenStreamComponents(src, tok) {
@Override
protected void setReader(final Reader reader) throws
IOException {
super.setReader(reader);
}
};
}
}
If, for example, you want to remove stop words from document before
breaking it into n-grams, than you would need:
reader(document) -> SomeTokenizer -> StopFilter -> NGramTokenFilter
Regards,
Ivan Krišto
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]