It will be great to support early termination for top-K queries within the DAAT query processing model in Lucene. There is quite some work published in related areas. http://portal.acm.org/citation.cfm?id=956944 is one of them.
Am I getting it right? If a query requires top-K results, isn't it sufficient to find top-K results in each segment and merge them to return the overall top-K results? (This is the most straightforward way. Many optimizations are possible). Early termination happens in finding top-K results in one segment. Assuming each document has a static score, document ids are assigned in the same order of their static scores within a segment. If a top-K query is scored by the same static score, query processing on a segment can stop as soon as the first K results are found. If the score function is not just the static score but still highly related to it (and is monotonic...), query processing can stop as soon as the upper bound of the scores for the unprocessed documents are lower than the lowest score in the current top-K results. This also reveals the weakness of this scheme: the static score should be significant enough in the score function used by a query to enable significant early termination. This is true in many specialized indexes. As to the indexing side, applications should be able to pick such a static score? If Lucene score function is used, norm is a good candidate? (One tricky thing with norm is that it is updatable.) Documents have to be sorted and id-ed when first level segments are built. Merge process which builds higher level segments is still simple enough since each segment is already sorted by static score. So impact to factored merge policy is minimal. Cheers, Ning --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]