Something makes me curious: queries that can leverage sorted indices should be _very_ fast, for instance in your case they only need to look at 500 documents per segment at most (less in practice since we stop collecting as soon as a non-competitive hit is found), so why do you need to parallelize query execution?
On Fri, Feb 1, 2019 at 3:18 PM Michael Sokolov <msoko...@gmail.com> wrote: > > I want to propose an optimization to early termination that gives nice > speedups for large result sets when searching with multiple threads at the > cost of a small (controllable) probability of collecting documents out of > order: in benchmarks I see +60-70% QPS for tasks like HighTermDayOfYearSort > when topN=500, using 8 threads to search, and in our production system this > optimization cut the latency of our slowest queries substantially. > > In a multi-phase ranking scenario a typical pattern is to retrieve a > largish number of matches in a first pass using indexed sort, followed by a > second pass that re-ranks and selects a smaller top K, using a more > expensive ranking. N is chosen to provide sufficient probabililty of > finding the desired top K across the whole index, given that the index sort > is some approximation to the desired sort. When ranking by indexed sort, as > in TopFieldCollector, we can now early-terminate when a sufficient number > of matches have been found so that we only need retrieve N documents from > each segment. In single-threaded mode we can check against > minCompetitiveScore and terminate collection for each segment > appropriately, but when using multiple threads to search concurrently there > is no such coordination and we end up collecting N documents *per segment*, > which are then merged down to N. > > We do not need to collect so many documents though. For any given segment, > let p=(leaf.maxDoc/topLevel.maxDoc) be the proportion of documents in that > segment. Assuming that documents are distributed randomly among segments, > we can expect that on average we will find p*N of the top N documents in > the given segment. If we only collect p*N documents, we will sometimes miss > some documents that we should have collected, collecting some > less-competitive documents from one segment, while not collecting all the > competitive documents from another. But how many should we collect in order > to make this occur only very rarely? > > The worst case is that all top N documents occur in a single segment. For > even small values of N and small numbers of segments S, this probability is > vanishingly small (N=10, S=10) -> 10^(1-N) = 1/10^9. More generally, this > distribution of documents among segments is a multinomial distribution, and > the variance of the number of documents in a single segment is that of a > binomial distribution. The binomial variance in this case (p=probability of > document in the segment, N number of documents) is p*(1-p)*N; we can use > this to compute the number of documents to collect per leaf in order to > bound the probability of a ranking error. I'm using a cutoff of 3 standard > deviations, i.e. collecting p*N + 3*(p*(1-p)*N)^1/2 documents for each > segment. For N=500, p=0.2, we can collect 67 documents instead of 500 at > the cost of an error that occurs < 3/1000. > > Also note that the kind of errors we make are typically benign. In most > cases we will return the correct top N-1 documents, but instead of > returning the Nth-ranked document in position N, we return the N+1st. > > Implementing this in Lucene requires a small patch to TopFieldCollector to > introduce a leafHitsThreshold comparable to the existing > totalHitsThreshold. Given the possibility of error, it might be good to > have a way to disable this, but my inclination would be to enable it > whenever approximate counts are enabled (ie by default), and disable when > totalHitsThreshold is MAX_VALUE. > > What do you think? Shall I open an issue? -- Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org