Hi
I am trying to understand why I am not able to retrieve docs I have indexed by
a ShingleAnalyzer. The setup is as follows:
During indexing I do the following:
PerFieldAnalyzerWrapper wrapper =
DocFieldAnalyzerWrapper.getDocFieldAnalyzerWrapper(Stopwords);
writer = new IndexWriter(_lucenedir,
new
IndexWriterConfig(Version.LUCENE_32,wrapper));
where DocFieldAnalyzerWrapper returns an instance of the PerFieldAnalyzerWrapper
public static PerFieldAnalyzerWrapper
getDocFieldAnalyzerWrapper(HashSet<String> Stopwords){
PerFieldAnalyzerWrapper wrapper = new
PerFieldAnalyzerWrapper(new KeywordAnalyzer());
wrapper.addAnalyzer("title",new KeywordAnalyzer());
wrapper.addAnalyzer("titleSynonyms",new
KeywordAnalyzer());
wrapper.addAnalyzer("date",new KeywordAnalyzer());
wrapper.addAnalyzer("about",new KeywordAnalyzer());
wrapper.addAnalyzer("titleAnalyzed",new
StandardAnalyzer(Version.LUCENE_32,Stopwords));
wrapper.addAnalyzer("content",new
LimitTokenCountAnalyzer(
new StandardAnalyzer(Version.LUCENE_32,Stopwords),
Integer.MAX_VALUE));
wrapper.addAnalyzer("contentForSpelling",new
ShinglesAnalyzer(2,Stopwords));
return wrapper;
}
where the custom ShinglesAnalyzer is defined as follows:
public class ShinglesAnalyzer extends Analyzer {
private HashSet<String> Stopwords;
private Integer shingleSize;
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream filter = new ShingleFilter(
new
StopFilter(Version.LUCENE_32,
new
LowerCaseFilter(Version.LUCENE_32,
new
StandardFilter(Version.LUCENE_32,
new
StandardTokenizer(Version.LUCENE_32, reader))),
Stopwords),
shingleSize);
return filter;
}
}
Then index as follows (note, all fields are set to ANALYZED because the fields
that are not analyzed are set to be KeywordAnalyzer)
doc.add(new
Field("title",title,Field.Store.YES, Field.Index.ANALYZED));
doc.add(new
Field("titleAnalyzed",title,Field.Store.YES,Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new
Field("titleSynonyms",pageSynonmy.toString(),Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new
Field("about",article.getAbout().toString(),Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field("date",
article.getDateCreated(),Field.Store.NO, Field.Index.ANALYZED));
String content = article.getCleanContent();
Field contentField = new Field("content",
content, Field.Store.NO,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.add(contentField);
Field contentSpellingField = new
Field("contentForSpelling",
content, Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.add(contentSpellingField);
Looking at index using luke the field "contentForSpelling" is indexed using
both unigram and bi-gram (Shingles is set to be 2).
Then during search time given a query q, which is a sentence provided by the
user, I do the following:
ShingleAnalyzerWrapper analyzer = new
ShinglesAnalyzer(2,Stopwords);
QueryParser parser = new QueryParser(Version.LUCENE_32,
"contentForSpelling",analyzer);
Query query = parser.parse(q);
TopDocs hits = searcher.search(query);
This is the output
query: $13 for any of season package at Dallas
ShinglesAnalyzer:
1: [13:1->3:<NUM>] [13 _:1->15:shingle]
2: [_ season:15->21:shingle]
3: [season:15->21:<ALPHANUM>] [season package:15->29:shingle]
4: [package:22->29:<ALPHANUM>] [package _:22->33:shingle]
5: [_ dallas:33->39:shingle]
6: [dallas:33->39:<ALPHANUM>]
but when I print the query (query.toString()) it looks like this
analyzed query: contentForSpelling:13 contentForSpelling:season
contentForSpelling:package contentForSpelling:dallas
But the query looks wrong to me.
thank you
Peyman