I want to propose an optimization to early termination that gives nice speedups for large result sets when searching with multiple threads at the cost of a small (controllable) probability of collecting documents out of order: in benchmarks I see +60-70% QPS for tasks like HighTermDayOfYearSort when topN=500, using 8 threads to search, and in our production system this optimization cut the latency of our slowest queries substantially.
In a multi-phase ranking scenario a typical pattern is to retrieve a largish number of matches in a first pass using indexed sort, followed by a second pass that re-ranks and selects a smaller top K, using a more expensive ranking. N is chosen to provide sufficient probabililty of finding the desired top K across the whole index, given that the index sort is some approximation to the desired sort. When ranking by indexed sort, as in TopFieldCollector, we can now early-terminate when a sufficient number of matches have been found so that we only need retrieve N documents from each segment. In single-threaded mode we can check against minCompetitiveScore and terminate collection for each segment appropriately, but when using multiple threads to search concurrently there is no such coordination and we end up collecting N documents *per segment*, which are then merged down to N. We do not need to collect so many documents though. For any given segment, let p=(leaf.maxDoc/topLevel.maxDoc) be the proportion of documents in that segment. Assuming that documents are distributed randomly among segments, we can expect that on average we will find p*N of the top N documents in the given segment. If we only collect p*N documents, we will sometimes miss some documents that we should have collected, collecting some less-competitive documents from one segment, while not collecting all the competitive documents from another. But how many should we collect in order to make this occur only very rarely? The worst case is that all top N documents occur in a single segment. For even small values of N and small numbers of segments S, this probability is vanishingly small (N=10, S=10) -> 10^(1-N) = 1/10^9. More generally, this distribution of documents among segments is a multinomial distribution, and the variance of the number of documents in a single segment is that of a binomial distribution. The binomial variance in this case (p=probability of document in the segment, N number of documents) is p*(1-p)*N; we can use this to compute the number of documents to collect per leaf in order to bound the probability of a ranking error. I'm using a cutoff of 3 standard deviations, i.e. collecting p*N + 3*(p*(1-p)*N)^1/2 documents for each segment. For N=500, p=0.2, we can collect 67 documents instead of 500 at the cost of an error that occurs < 3/1000. Also note that the kind of errors we make are typically benign. In most cases we will return the correct top N-1 documents, but instead of returning the Nth-ranked document in position N, we return the N+1st. Implementing this in Lucene requires a small patch to TopFieldCollector to introduce a leafHitsThreshold comparable to the existing totalHitsThreshold. Given the possibility of error, it might be good to have a way to disable this, but my inclination would be to enable it whenever approximate counts are enabled (ie by default), and disable when totalHitsThreshold is MAX_VALUE. What do you think? Shall I open an issue?