Hello For a source code (git repo) search engine I choose to use an ngram analyzer for substring search (something like "git blame").
This worked fine except it didn't find some strings. I tracked it down to the analyzer. When the ngram analyzer yielded about 1000 terms it stopped yielding more terms, seem to be at most (1024 - ngram_length) terms. When I use StandardAnalyzer it works as expected. Is this a bug or did I miss a limit? Tested with lucene-2.9.1 and 3.0, this is the core routine I use: public static class NGramAnalyzer5 extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { return new NGramTokenizer(reader, 5, 5); } } public static String[] analyzeString(Analyzer analyzer, String fieldName, String string) throws IOException { List<String> output = new ArrayList<String>(); TokenStream tokenStream = analyzer.tokenStream(fieldName, new StringReader(string)); TermAttribute termAtt = (TermAttribute)tokenStream.addAttribute( TermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { output.add(termAtt.term()); } tokenStream.end(); tokenStream.close(); return output.toArray(new String[0]); } The complete example is attached. "in.txt" must be in "." and is plain ASCII. Stefan
import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.io.Reader; import java.io.StringReader; import java.util.ArrayList; import java.util.List; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.ngram.NGramTokenizer; import org.apache.lucene.analysis.tokenattributes.TermAttribute; public class AnalyzeString { public static class NGramAnalyzer5 extends Analyzer { public NGramAnalyzer5() { super(); } @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new NGramTokenizer(reader, 4, 4); } } public static String[] analyzeString(Analyzer analyzer, String fieldName, String string) throws IOException { List<String> output = new ArrayList<String>(); TokenStream tokenStream = analyzer.tokenStream(fieldName, new StringReader(string)); TermAttribute termAtt = (TermAttribute)tokenStream.addAttribute(TermAttribute.class); // 3.0.0: TermAttribute termAtt = tokenStream.addAttribute(TermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { output.add(termAtt.term()); } tokenStream.end(); tokenStream.close(); return output.toArray(new String[0]); } public static void main(String[] args) throws IOException { BufferedReader in = new BufferedReader(new FileReader(new File("in.txt"))); StringBuilder b = new StringBuilder(); String line; while ((line = in.readLine()) != null) { b.append(line); } String[] result = analyzeString(new NGramAnalyzer5(), "", b.toString()); for (String s: result) { System.out.println(s); } } }
Applications that build their search capabilities upon Lucene may support documents in various formats HTML, XML, PDF, Word just to name a few. Lucene does not care about the Parsing of these and other document formats, and it is the responsibility of the application using Lucene to use an appropriate Parser to convert the original format into plain text before passing that plain text to Lucene. Plain text passed to Lucene for indexing goes through a process generally called tokenization namely breaking of the input text into small indexing elements Tokens. The way input text is broken into tokens very much dictates further capabilities of search upon that text. For instance, sentences beginnings and endings can be identified to provide for more accurate phrase and proximity searches (though sentence identification is not provided by Lucene). Analysis is one of the main causes of performance degradation during indexing. Simply put, the more you analyze the slower the indexing (in most cases). Perhaps your application would be just fine using the simple WhitespaceTokenizer combined with a StopFilter. The contrib/benchmark library can be useful for testing out the speed of the analysis process. Applications usually do not invoke analysis Lucene does it for them: * At indexing, as a consequence of addDocument(doc), the Analyzer in effect for indexing is invoked for each indexed field of the added document. * At search, as a consequence of QueryParser.parse(queryText), the QueryParser may invoke the Analyzer in effect. Note that for some queries analysis does not take place, e.g. wildcard queries. However an application might invoke Analysis of any text for testing or for any other purpose, something like: Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some text goes here")); Token t = ts.next(); while (t!=null) { System.out.println("token: "+t)); t = ts.next(); } Selecting the "correct" analyzer is crucial for search quality, and can also affect indexing and search performance. The "correct" analyzer differs between applications. Lucene java's wiki page AnalysisParalysis provides some data on "analyzing your analyzer". Here are some rules of thumb: 1. Test test test... (did we say test?) 2. Beware of over analysis might hurt indexing performance. 3. Start with same analyzer for indexing and search, otherwise searches would not find what they are supposed to... 4. In some cases a different analyzer is required for indexing and search, for instance: * Certain searches require more stop words to be filtered. (I.e. more than those that were filtered at indexing.) * Query expansion by synonyms, acronyms, auto spell correction, etc. This might sometimes require a modified analyzer see the next section on how to do that. Implementing your own Analyzer Creating your own Analyzer is straightforward. It usually involves either wrapping an existing Tokenizer and set of TokenFilters to create a new Analyzer or creating both the Analyzer and a Tokenizer or TokenFilter. Before pursuing this approach, you may find it worthwhile to explore the contrib/analyzers library and/or ask on the java-user@lucene.apache.org mailing list first to see if what you need already exists. If you are still committed to creating your own Analyzer or TokenStream derivation (Tokenizer or TokenFilter) have a look at the source code of any one of the many samples located in this package. The following sections discuss some aspects of implementing your own analyzer.
--------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org