Hi,
I am indexing three words in a document.
Then I run a phrase query on that document searching for two words at a time
and three words at a time.
I use PorterStemFilter for both searching and indexing. I am getting very
inconsistent results. Am I doing something incorrectly ?
The way I use PorterStemmer is by overriding tokenStream() method of
StandardAnalyzer and adding PorterStemFiler to the chain.
If I use StandardAnalyzer everything works fine. I am suspecting the way I
am creating the analyzer.
I printed position increments, offsets etc for both cases and did not see
any difference.
Below are the tests I am running and the full code.
tests:
Indexed content : "one two three" search : "one two" no documents found
Indexed content : "one two three" search : "one two three" no documents
found
Indexed content : "first second third" search : "first second" one
documents found
Indexed content : "first second third" search :"first second third" one
documents found
Indexed content : "good bad ugly" search : "good bad" one documents found
Indexed content : "good bad ugly" search :"good bad ugly" no documents
found
The below is the code:
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.PorterStemFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.ParseException;
import java.io.Reader;
import java.io.IOException;
public class TestPorterStemmer {
public static void main(String[] args) throws IOException,
ParseException {
RAMDirectory index = new RAMDirectory();
IndexWriter writer = new IndexWriter(index, getAnalyzer(), true,
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field("content", "good bad ugly", Field.Store.YES,
Field.Index.ANALYZED));
writer.addDocument(doc);
writer.optimize();
writer.close();
IndexSearcher searcher = new IndexSearcher(index);
QueryParser parser = new QueryParser("content", getAnalyzer());
Query query = parser.parse("\"" + "good bad" + "\"");
Hits hits = searcher.search(query);
System.out.println("searched for " + query.toString() + " matched :
" + hits.length() + " documents ");
query = parser.parse("\"" + "good bad ugly" + "\"");
hits = searcher.search(query);
System.out.println("searched for " + query.toString() + " matched :
" + hits.length() + " documents ");
}
public static StandardAnalyzer getAnalyzer() {
return new StandardAnalyzer() {
public TokenStream tokenStream(String fieldName, Reader reader)
{
TokenStream result = super.tokenStream(fieldName, reader);
return new PorterStemFilter(result);
}
};
}
}
output:
searched for content:"good bad" matched : 1 documents
searched for content:"good bad ugli" matched : 0 documents
Any help is greatly appreciated...
Thanks
Preetam