Hi,

I am using apache lucene 3.0.2 and searching for an optimal analyzer to search 
for best matching http user agents. Imagine, that we store following http user 
agents in a field:

Lynx/2.8.4rel.1 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.6c
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)

Now as search query a best matching agent for the following input should be 
returned:

Mozilla/4.1 (compatible; MSIE 6.0; Windows NT 5.0)

From my natural view the Mozilla/4.0 is the best fit result. What analyzer do I 
need to use to store and find it? The text not natural, so I need some kind of 
n gram search (I guess). My initial setup does not return it at all:

String agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)";
final static Analyzer analyzer = new NGramAnalyzer(2, 4);
final Document doc = new Document();
doc.add(new Field("agent", agent, Field.Store.YES, Field.Index.ANALYZED));
...
final QueryParser parser = new QueryParser(Version.LUCENE_30, "content", 
analyzer);
final Query query = parser.parse("Mozilla/4.1 (compatible; MSIE 6.0; Windows NT 
5.0)");
final TopScoreDocCollector collector = TopScoreDocCollector.create(50, true);
searcher.search(query, collector);

NGramAnalyzer is defined as:

import java.io.Reader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.NGramTokenizer;

public class NGramAnalyzer extends Analyzer {

        private final int minGram;
        private final int maxGram;

        public NGramAnalyzer(final int minGram, final int maxGram) {
                this.minGram = minGram;
                this.maxGram = maxGram;
        }

        @Override
        public TokenStream tokenStream(final String fieldName, final Reader 
reader) {
                return new NGramTokenizer(reader, minGram, maxGram);
        }
}


Thank you very much for a solution or any other approach.

Maciej
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to