Hi, I am using apache lucene 3.0.2 and searching for an optimal analyzer to search for best matching http user agents. Imagine, that we store following http user agents in a field:
Lynx/2.8.4rel.1 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.6c Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30) Now as search query a best matching agent for the following input should be returned: Mozilla/4.1 (compatible; MSIE 6.0; Windows NT 5.0) From my natural view the Mozilla/4.0 is the best fit result. What analyzer do I need to use to store and find it? The text not natural, so I need some kind of n gram search (I guess). My initial setup does not return it at all: String agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"; final static Analyzer analyzer = new NGramAnalyzer(2, 4); final Document doc = new Document(); doc.add(new Field("agent", agent, Field.Store.YES, Field.Index.ANALYZED)); ... final QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer); final Query query = parser.parse("Mozilla/4.1 (compatible; MSIE 6.0; Windows NT 5.0)"); final TopScoreDocCollector collector = TopScoreDocCollector.create(50, true); searcher.search(query, collector); NGramAnalyzer is defined as: import java.io.Reader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.ngram.NGramTokenizer; public class NGramAnalyzer extends Analyzer { private final int minGram; private final int maxGram; public NGramAnalyzer(final int minGram, final int maxGram) { this.minGram = minGram; this.maxGram = maxGram; } @Override public TokenStream tokenStream(final String fieldName, final Reader reader) { return new NGramTokenizer(reader, minGram, maxGram); } } Thank you very much for a solution or any other approach. Maciej --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org