Re: [jira] Created: (LUCENE-851) Pruning

Ning Li Thu, 29 Mar 2007 18:45:15 -0800

It will be great to support early termination for top-K queries within
the DAAT query processing model in Lucene. There is quite some work
published in related areas.
http://portal.acm.org/citation.cfm?id=956944 is one of them.


Am I getting it right? If a query requires top-K results, isn't it
sufficient to find top-K results in each segment and merge them to
return the overall top-K results? (This is the most straightforward
way. Many optimizations are possible). Early termination happens in
finding top-K results in one segment. Assuming each document has a
static score, document ids are assigned in the same order of their
static scores within a segment. If a top-K query is scored by the same
static score, query processing on a segment can stop as soon as the
first K results are found. If the score function is not just the
static score but still highly related to it (and is monotonic...),
query processing can stop as soon as the upper bound of the scores for
the unprocessed documents are lower than the lowest score in the
current top-K results. This also reveals the weakness of this scheme:
the static score should be significant enough in the score function
used by a query to enable significant early termination. This is true
in many specialized indexes.

As to the indexing side, applications should be able to pick such a
static score? If Lucene score function is used, norm is a good
candidate? (One tricky thing with norm is that it is updatable.)
Documents have to be sorted and id-ed when first level segments are
built. Merge process which builds higher level segments is still
simple enough since each segment is already sorted by static score. So
impact to factored merge policy is minimal.

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Created: (LUCENE-851) Pruning

Reply via email to