I want to propose an optimization to early termination that gives nice
speedups for large result sets when searching with multiple threads at the
cost of a small (controllable) probability of collecting documents out of
order: in benchmarks I see +60-70% QPS for tasks like HighTermDayOfYearSort
when topN=500, using 8 threads to search, and in our production system this
optimization cut the latency of our slowest queries substantially.

In a multi-phase ranking scenario a typical pattern is to retrieve a
largish number of matches in a first pass using indexed sort, followed by a
second pass that re-ranks and selects a smaller top K, using a more
expensive ranking. N is chosen to provide sufficient probabililty of
finding the desired top K across the whole index, given that the index sort
is some approximation to the desired sort. When ranking by indexed sort, as
in TopFieldCollector, we can now early-terminate when a sufficient number
of matches have been found so that we only need retrieve N documents from
each segment. In single-threaded mode we can check against
minCompetitiveScore and terminate collection for each segment
appropriately, but when using multiple threads to search concurrently there
is no such coordination and we end up collecting N documents *per segment*,
which are then merged down to N.

We do not need to collect so many documents though. For any given segment,
let p=(leaf.maxDoc/topLevel.maxDoc) be the proportion of documents in that
segment. Assuming that documents are distributed randomly among segments,
we can expect that on average we will find p*N of the top N documents in
the given segment. If we only collect p*N documents, we will sometimes miss
some documents that we should have collected, collecting some
less-competitive documents from one segment, while not collecting all the
competitive documents from another. But how many should we collect in order
to make this occur only very rarely?

The worst case is that all top N documents occur in a single segment. For
even small values of N and small numbers of segments S, this probability is
vanishingly small (N=10, S=10) -> 10^(1-N) = 1/10^9. More generally, this
distribution of documents among segments is a multinomial distribution, and
the variance of the number of documents in a single segment is that of a
binomial distribution. The binomial variance in this case (p=probability of
document in the segment, N number of documents) is p*(1-p)*N; we can use
this to compute the number of documents to collect per leaf in order to
bound the probability of a ranking error. I'm using a cutoff of 3 standard
deviations, i.e.  collecting p*N + 3*(p*(1-p)*N)^1/2 documents for each
segment. For N=500, p=0.2, we can collect 67 documents instead of 500 at
the cost of an error that occurs < 3/1000.

Also note that the kind of errors we make are typically benign. In most
cases we will return the correct top N-1 documents, but instead of
returning the Nth-ranked document in position N, we return the N+1st.

Implementing this in Lucene requires a small patch to TopFieldCollector to
introduce a leafHitsThreshold comparable to the existing
totalHitsThreshold. Given the possibility of error, it might be good to
have a way to disable this, but my inclination would be to enable it
whenever approximate counts are enabled (ie by default), and disable when
totalHitsThreshold is MAX_VALUE.

What do you think? Shall I open an issue?

Reply via email to