Hi all,
I am quite new to the Lucene world and recently started using its python
wrapper (PyLucence) in my project.
So far, I have been using the token based querying method which works fine.
But, now I want to modify the querying approach as the following:
- Given the query string
- extract all its terms (n-grams; n>=2)
- for each term search the Indexer and return the (k) documents which
contain that specific term
let's say that the input text is: "search for this !"
And I want to search for each of these sub-strings, separately:
"search for", "for this", "this !", "search for this", "for this ."
I tried with Shingle, but I couldn't get the desired output. You can find
my code at the end.
I added line 14 to the code to avoid searching for unigrams which are of no
interest to me.
It is worth mentioning that for creating both the Indexer and Searcher I
used the following settings:
analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 4, ' ', True,
False, None)
The point is that if I remove line 14, it does return me some documents
which they contain the words of the given n-gram, but they do not
necessarily make one single phrase. I mean if the n-gram is *"search for
this"* it might return a document like: *"please search for that and
this"*
Which means it just looked for the unigrams, not the whole string.
Any idea about this issue?
Thanks,
Amin
1 def query(self, queryString):
2 vec = {}
3 idx = 0
4 ret_documents = {}
5 analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 4, ' ',
True, False, None)
6 ts = self.analyzer.tokenStream("source", StringReader(queryString))
7 termAtt = ts.addAttribute(CharTermAttribute.class_)
8 ts.reset()
9
10 all_grams = []
11
12 while ts.incrementToken():
13 ngram = termAtt.toString()
14 if len(ngram.split()) > 1:
15 all_grams.append(ngram)
16 ts.close()
17
18 for ngram in all_grams:
19 query = BooleanQuery.Builder()
20 query.add(TermQuery(Term("source", ngram)),
BooleanClause.Occur.MUST)
21 scoreDocs = self.searcher.search(query.build(),
self.max_retriever).scoreDocs
22
23 for scoreDoc in scoreDocs:
24 doc = self.searcher.doc(scoreDoc.doc)
25 ret_documents[idx] = [doc.get("id"), scoreDoc.score,
doc.get("source")]
26 idx += 1
27
28 print "RET_DOCS: ", ret_documents