Hello,
I'm having an issue creating a custom analyzer utilizing the
WordDelimiterFilter. I'm attempting to create an index of information
gleaned from JAR manifest files. So if I have "spring-framework" I need the
following tokens indexed: "spring" "springframework" "framework"
"spring-framework". My understanding is that the WordDelimiterFilter is
perfect for this. However, when I introduce the filter to the analyzer I
don't seem to get any documents indexed correctly.
Here is the analyzer:
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter;
import org.apache.lucene.util.Version;
public class FieldAnalyzer extends Analyzer {
private Version version = null;
public FieldAnalyzer(Version version) {
this.version = version;
}
@Override
protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
Tokenizer source = new WhitespaceTokenizer(version, reader);
TokenStream stream = source;
stream = new WordDelimiterFilter(stream,
WordDelimiterFilter.CATENATE_WORDS
& WordDelimiterFilter.GENERATE_WORD_PARTS
& WordDelimiterFilter.PRESERVE_ORIGINAL
& WordDelimiterFilter.SPLIT_ON_CASE_CHANGE
& WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE, null);
stream = new LowerCaseFilter(version, stream);
stream = new StopFilter(version, stream,
StopAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(source, stream);
}
}
//-------------------------------------------------
Performing a very simple test results in zero document found:
Analyzer analyzer = new FieldAnalyzer(Version.LUCENE_40);
Directory index = new RAMDirectory();
String text = "spring-framework";
String field = "field";
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40,
analyzer);
IndexWriter w = new IndexWriter(index, config);
Document doc = new Document();
doc.add(new TextField(field, text, Field.Store.YES));
w.addDocument(doc);
w.close();
String querystr = "spring-framework";
Query q = new AnalyzingQueryParser(Version.LUCENE_40, field,
analyzer).parse(querystr);
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector =
TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get(field));
}
Any idea what I've done wrong? If I comment out the addition of
WordDelimiterFilter - the search works.
Thanks in advance,
Jeremy