I thought I could use the KeywordTokenizer to prevent tokenizing on spaces.
so I can treat some fields as a single term. But it's still tokenizing on
spaces.
In the code below, I'm storing a document with a serial number containing
spaces. I want to treat it as a single term without having end users
making it a phrase query by surrounding it with double quotes. But it
doesn't work as I thought it would. Is there something I need to be doing
differently? Shouldn't the keyword tokenizer treat the entire text as one
token?
------------
This is the custom analyzer class I use.
private static class LowerCaseKeywordAnalyzer extends Analyzer
{
@Override
protected TokenStreamComponents createComponents(String
theFieldName,
Reader theReader)
{
Tokenizer theTokenizer = new KeywordTokenizer(theReader);
TokenStream theTokenStream =
new LowerCaseFilter(Version.LUCENE_46, theTokenizer);
TokenStreamComponents theTokenStreamComponents =
new TokenStreamComponents(theTokenizer, theTokenStream);
return theTokenStreamComponents;
}
}
The code using the analyzer
Version theVersion = Version.LUCENE_46;
Directory theIndex = new RAMDirectory();
Analyzer theAnalyzer = new LowerCaseKeywordAnalyzer();
IndexWriterConfig theConfig =
new IndexWriterConfig(theVersion, theAnalyzer);
IndexWriter theWriter = new IndexWriter(theIndex, theConfig);
Document theDocument = new Document();
FieldType theFieldType = new FieldType();
theFieldType.setStored(true);
theFieldType.setIndexed(true);
theFieldType.setTokenized(false);
theDocument.add(new Field("sn", "1023 4567 8765", theFieldType));
theWriter.addDocument(theDocument);
theWriter.close();
String[] theQueryStrings = new String[]
{
"\"1023 4567 8765\"",
"1023 4567 8765"
};
QueryParser theParser = new QueryParser(theVersion, "sn", theAnalyzer);
IndexReader theIndexReader = DirectoryReader.open(theIndex);
IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
for (int i = 0; i < theQueryStrings.length; i++) {
String currQueryStr = theQueryStrings[i];
Query currQuery = theParser.parse("sn:" + currQueryStr);
System.out.println(currQuery.getClass() + ", " + currQuery);
TopScoreDocCollector currCollector =
TopScoreDocCollector.create(10, true);
theSearcher.search(currQuery, currCollector);
ScoreDoc[] currHits = currCollector.topDocs().scoreDocs;
String msg = "Number of results found for '" + currQueryStr +
"': " + currHits.length;
System.out.println(msg);
}
The output
class org.apache.lucene.search.TermQuery, sn:1023 4567 8765
Number of results found for '"1023 4567 8765"': 1
class org.apache.lucene.search.BooleanQuery, sn:1023 sn:4567 sn:8765
Number of results found for '1023 4567 8765': 0
--
Regards
Milind