Re: Beyond Lucene 2.0 Index Design

Ming Lei Wed, 10 Jan 2007 15:36:07 -0800

The idea of "impact" and "impact-sorted posting list"
should practically work with boolean model by
approximation in the following way:


(1) Index Structure
Inverted-Index : <term, posting-list>*
posting-list: <impact, docID, occurrence*>+   (sorted
by impact)
occurrence: position

(2) Retrieval Algorithm for boolean query "a AND b"

set an impact threshold imp = imp0

get set of docs (each with impact) from a's posting
list with impact of each doc larger than imp.
get set of docs from b's .......
join the two sets by docID and aggregate the impacts
of the same doc from the two sets. And sort the result
set by impact. 
(for union query, union the two sets and aggregates
impacts by docID....)

if the result set size is too small, 
  set a lower imp and **redo** the above.

Note: If your major concern is relevance rather than
recall for your retrieval (eg, on web), you will
seldom hit the "redo" for conjunctive boolean queries.

But often Lucene is used in an environment of small
corpus and semi-structured data. 


Michael


 
____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Beyond Lucene 2.0 Index Design

Reply via email to