ShinglesAnalyzer Queston

Peyman Faratin Sun, 09 Oct 2011 09:12:40 -0700

Hi

I am trying to understand why I am not able to retrieve docs I have indexed by 
a ShingleAnalyzer. The setup is as follows:



During indexing I do the following:

                PerFieldAnalyzerWrapper wrapper = 
DocFieldAnalyzerWrapper.getDocFieldAnalyzerWrapper(Stopwords);        
                writer = new IndexWriter(_lucenedir,
                                new 
IndexWriterConfig(Version.LUCENE_32,wrapper));

where DocFieldAnalyzerWrapper returns an instance of the PerFieldAnalyzerWrapper

                public static PerFieldAnalyzerWrapper 
getDocFieldAnalyzerWrapper(HashSet<String> Stopwords){
                        PerFieldAnalyzerWrapper wrapper = new 
PerFieldAnalyzerWrapper(new KeywordAnalyzer());
                        wrapper.addAnalyzer("title",new KeywordAnalyzer());
                        wrapper.addAnalyzer("titleSynonyms",new 
KeywordAnalyzer());
                        wrapper.addAnalyzer("date",new KeywordAnalyzer());
                        wrapper.addAnalyzer("about",new KeywordAnalyzer());

                        wrapper.addAnalyzer("titleAnalyzed",new 
StandardAnalyzer(Version.LUCENE_32,Stopwords));
                        wrapper.addAnalyzer("content",new 
LimitTokenCountAnalyzer(
                                                                                
new StandardAnalyzer(Version.LUCENE_32,Stopwords),
                                                                                
        Integer.MAX_VALUE));
                        wrapper.addAnalyzer("contentForSpelling",new 
ShinglesAnalyzer(2,Stopwords));
                        return wrapper;
                }

where the custom ShinglesAnalyzer is defined as follows: 

         public class ShinglesAnalyzer extends Analyzer {
          private HashSet<String> Stopwords;
          private Integer shingleSize;
          public TokenStream tokenStream(String fieldName, Reader reader) {
                  TokenStream filter = new ShingleFilter(
                                                new 
StopFilter(Version.LUCENE_32,
                                                new 
LowerCaseFilter(Version.LUCENE_32,
                                                new 
StandardFilter(Version.LUCENE_32,
                                                new 
StandardTokenizer(Version.LUCENE_32, reader))),
                                                Stopwords),
                                                shingleSize);             
                   return filter;
                }
        }

Then index as follows (note, all fields are set to ANALYZED because the fields 
that are not analyzed are set to be KeywordAnalyzer)

                                doc.add(new 
Field("title",title,Field.Store.YES, Field.Index.ANALYZED));
                                doc.add(new 
Field("titleAnalyzed",title,Field.Store.YES,Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS));
                                doc.add(new 
Field("titleSynonyms",pageSynonmy.toString(),Field.Store.YES, 
Field.Index.ANALYZED));
                                doc.add(new 
Field("about",article.getAbout().toString(),Field.Store.YES, 
Field.Index.ANALYZED));
                                doc.add(new Field("date", 
article.getDateCreated(),Field.Store.NO, Field.Index.ANALYZED));
                                
                                String content = article.getCleanContent();
                                Field contentField = new Field("content",
                                                content, Field.Store.NO,
                                                Field.Index.ANALYZED,
                                                
Field.TermVector.WITH_POSITIONS_OFFSETS);
                                doc.add(contentField);
                                
                                Field contentSpellingField = new 
Field("contentForSpelling",
                                                content, Field.Store.YES,
                                                Field.Index.ANALYZED,
                                                
Field.TermVector.WITH_POSITIONS_OFFSETS);
                                doc.add(contentSpellingField);

Looking at index using luke the field "contentForSpelling" is indexed using 
both unigram and bi-gram (Shingles is set to be 2). 

Then during search time given a query q, which is a sentence provided by the 
user, I do the following:

                  ShingleAnalyzerWrapper  analyzer = new 
ShinglesAnalyzer(2,Stopwords);
                  QueryParser parser = new QueryParser(Version.LUCENE_32, 
"contentForSpelling",analyzer);
                  Query query = parser.parse(q);
                  TopDocs hits = searcher.search(query);


This is the output

query: $13 for any of season package at Dallas

ShinglesAnalyzer:
    
1: [13:1->3:<NUM>] [13 _:1->15:shingle] 
2: [_ season:15->21:shingle] 
3: [season:15->21:<ALPHANUM>] [season package:15->29:shingle] 
4: [package:22->29:<ALPHANUM>] [package _:22->33:shingle] 
5: [_ dallas:33->39:shingle] 
6: [dallas:33->39:<ALPHANUM>] 

but when I print the query (query.toString()) it looks like this 

analyzed query: contentForSpelling:13 contentForSpelling:season 
contentForSpelling:package contentForSpelling:dallas

But the query looks wrong to me. 

thank you 

Peyman

ShinglesAnalyzer Queston

Reply via email to