NGramTokenizer stops working after about 1000 terms

Stefan Trcek Mon, 14 Dec 2009 06:40:16 -0800

Hello

For a source code (git repo) search engine I choose to use an ngram 
analyzer for substring search (something like "git blame").


This worked fine except it didn't find some strings. I tracked it down 
to the analyzer. When the ngram analyzer yielded about 1000 terms it 
stopped yielding more terms, seem to be at most (1024 - ngram_length) 
terms. When I use StandardAnalyzer it works as expected.
Is this a bug or did I miss a limit?

Tested with lucene-2.9.1 and 3.0, this is the core routine I use:

public static class NGramAnalyzer5 extends Analyzer {
    public TokenStream tokenStream(String fieldName, Reader reader) {
        return new NGramTokenizer(reader, 5, 5);
    }
}

public static String[] analyzeString(Analyzer analyzer,
            String fieldName, String string) throws IOException {
    List<String> output = new ArrayList<String>();
    TokenStream tokenStream = analyzer.tokenStream(fieldName,
            new StringReader(string));
    TermAttribute termAtt = (TermAttribute)tokenStream.addAttribute(
            TermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
        output.add(termAtt.term());
    }
    tokenStream.end();
    tokenStream.close();
    return output.toArray(new String[0]);
}  

The complete example is attached. "in.txt" must be in "." and is plain 
ASCII.

Stefan

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.NGramTokenizer;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;


public class AnalyzeString {

    public static class NGramAnalyzer5 extends Analyzer {

        public NGramAnalyzer5() {
            super();
        }

        @Override
        public TokenStream tokenStream(String fieldName, Reader reader) {
            return new NGramTokenizer(reader, 4, 4);
        }

    }

    public static String[] analyzeString(Analyzer analyzer, String fieldName, String string) throws IOException {
        List<String> output = new ArrayList<String>();
        TokenStream tokenStream = analyzer.tokenStream(fieldName, new StringReader(string));
        TermAttribute termAtt = (TermAttribute)tokenStream.addAttribute(TermAttribute.class);
        // 3.0.0: TermAttribute termAtt = tokenStream.addAttribute(TermAttribute.class);
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            output.add(termAtt.term());
        }
        tokenStream.end();
        tokenStream.close();
        return output.toArray(new String[0]);
    }  

    public static void main(String[] args) throws IOException {
        BufferedReader in = new BufferedReader(new FileReader(new File("in.txt")));
        StringBuilder b = new StringBuilder();
        String line;
        while ((line = in.readLine()) != null) {
            b.append(line);
        }
        String[] result = analyzeString(new NGramAnalyzer5(), "", b.toString());
        for (String s: result) {
            System.out.println(s);
        }
    }

}

Applications that build their search capabilities upon Lucene may support
documents in various formats HTML, XML, PDF, Word
just to name a few. Lucene does not care about the Parsing of these and other
document formats, and it is the responsibility of the application using Lucene
to use an appropriate Parser to convert the original format into plain text
before passing that plain text to Lucene.
Plain text passed to Lucene for indexing goes through a process generally
called tokenization namely breaking of the input text into small indexing
elements Tokens. The way input text is broken into tokens very much dictates
further capabilities of search upon that text. For instance, sentences
beginnings and endings can be identified to provide for more accurate
phrase and proximity searches (though sentence identification is not
provided by Lucene).
Analysis is one of the main causes of performance degradation during indexing.
Simply put, the more you analyze the slower the indexing (in most cases).
Perhaps your application would be just fine using the simple
WhitespaceTokenizer combined with a StopFilter. The contrib/benchmark
library can be useful for testing out the speed of the analysis process.
 Applications usually do not invoke analysis Lucene does it for them:

    * At indexing, as a consequence of addDocument(doc), the Analyzer in effect 
for indexing is invoked for each indexed field of the added document.
    * At search, as a consequence of QueryParser.parse(queryText), the 
QueryParser may invoke the Analyzer in effect. Note that for some queries 
analysis does not take place, e.g. wildcard queries.

However an application might invoke Analysis of any text for testing or for any 
other purpose, something like:

      Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer
      TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some 
text goes here"));
      Token t = ts.next();
      while (t!=null) {
        System.out.println("token: "+t));
        t = ts.next();
      }

 Selecting the "correct" analyzer is crucial for search quality,
and can also affect indexing and search performance. The "correct" analyzer
differs between applications. Lucene java's wiki page AnalysisParalysis
provides some data on "analyzing your analyzer". Here are some rules of thumb:

   1. Test test test... (did we say test?)
   2. Beware of over analysis might hurt indexing performance.
   3. Start with same analyzer for indexing and search, otherwise searches 
would not find what they are supposed to...
   4. In some cases a different analyzer is required for indexing and search, 
for instance:
          * Certain searches require more stop words to be filtered. (I.e. more 
than those that were filtered at indexing.)
          * Query expansion by synonyms, acronyms, auto spell correction, etc.
      This might sometimes require a modified analyzer see the next section on 
how to do that.

Implementing your own Analyzer

Creating your own Analyzer is straightforward. It usually involves either
wrapping an existing Tokenizer and set of TokenFilters to create a new Analyzer
or creating both the Analyzer and a Tokenizer or TokenFilter. Before pursuing
this approach, you may find it worthwhile to explore the contrib/analyzers
library and/or ask on the java-user@lucene.apache.org mailing list first to
see if what you need already exists. If you are still committed to creating
your own Analyzer or TokenStream derivation (Tokenizer or TokenFilter) have a
look at the source code of any one of the many samples located in this package.

The following sections discuss some aspects of implementing your own analyzer.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

NGramTokenizer stops working after about 1000 terms

Reply via email to