I think this is because our per-hit cost is sometimes very high -- we have "post filters" that are sometimes very restrictive. We are working to get those post-filters out into an inverted index to make them more efficient, but net/net reducing how many hits we must collect for each segment can help latencies and throughput.
Mike McCandless http://blog.mikemccandless.com On Fri, Feb 1, 2019 at 11:42 AM Michael Sokolov <msoko...@gmail.com> wrote: > That's a good question. I don't have a very satisfying answer other than to > say we saw some improvements, and I would have to dig more now to say why. > It may be that in our system we have some additional per-document costs in > a custom collector that were saved by this reduction, and that's why we saw > a latency reduction. However I also note these luceneutil benchmarks > results: > > ** This is with -topN=500, and a searcher threadpool=8 > Report after iter 19: > TaskQPS baseline StdDevQPS candidate > StdDev Pct diff > HighTermDayOfYearSort 391.23 (2.4%) 627.92 (4.8%) > 60.5% ( 52% - 69%) > > ** This is with -topN=500, and no searcher threadpool > Report after iter 19: > TaskQPS baseline StdDevQPS candidate > StdDev Pct diff > HighTermDayOfYearSort 239.52 (3.3%) 232.70 (3.0%) > -2.8% ( -8% - 3%) > > which show QPS improvement from using threads even in the baseline case > (and then an additional improvement from prorating). > > On Fri, Feb 1, 2019 at 11:28 AM Adrien Grand <jpou...@gmail.com> wrote: > > > Something makes me curious: queries that can leverage sorted indices > > should be _very_ fast, for instance in your case they only need to > > look at 500 documents per segment at most (less in practice since we > > stop collecting as soon as a non-competitive hit is found), so why do > > you need to parallelize query execution? > > > > On Fri, Feb 1, 2019 at 3:18 PM Michael Sokolov <msoko...@gmail.com> > wrote: > > > > > > I want to propose an optimization to early termination that gives nice > > > speedups for large result sets when searching with multiple threads at > > the > > > cost of a small (controllable) probability of collecting documents out > of > > > order: in benchmarks I see +60-70% QPS for tasks like > > HighTermDayOfYearSort > > > when topN=500, using 8 threads to search, and in our production system > > this > > > optimization cut the latency of our slowest queries substantially. > > > > > > In a multi-phase ranking scenario a typical pattern is to retrieve a > > > largish number of matches in a first pass using indexed sort, followed > > by a > > > second pass that re-ranks and selects a smaller top K, using a more > > > expensive ranking. N is chosen to provide sufficient probabililty of > > > finding the desired top K across the whole index, given that the index > > sort > > > is some approximation to the desired sort. When ranking by indexed > sort, > > as > > > in TopFieldCollector, we can now early-terminate when a sufficient > number > > > of matches have been found so that we only need retrieve N documents > from > > > each segment. In single-threaded mode we can check against > > > minCompetitiveScore and terminate collection for each segment > > > appropriately, but when using multiple threads to search concurrently > > there > > > is no such coordination and we end up collecting N documents *per > > segment*, > > > which are then merged down to N. > > > > > > We do not need to collect so many documents though. For any given > > segment, > > > let p=(leaf.maxDoc/topLevel.maxDoc) be the proportion of documents in > > that > > > segment. Assuming that documents are distributed randomly among > segments, > > > we can expect that on average we will find p*N of the top N documents > in > > > the given segment. If we only collect p*N documents, we will sometimes > > miss > > > some documents that we should have collected, collecting some > > > less-competitive documents from one segment, while not collecting all > the > > > competitive documents from another. But how many should we collect in > > order > > > to make this occur only very rarely? > > > > > > The worst case is that all top N documents occur in a single segment. > For > > > even small values of N and small numbers of segments S, this > probability > > is > > > vanishingly small (N=10, S=10) -> 10^(1-N) = 1/10^9. More generally, > this > > > distribution of documents among segments is a multinomial distribution, > > and > > > the variance of the number of documents in a single segment is that of > a > > > binomial distribution. The binomial variance in this case > (p=probability > > of > > > document in the segment, N number of documents) is p*(1-p)*N; we can > use > > > this to compute the number of documents to collect per leaf in order to > > > bound the probability of a ranking error. I'm using a cutoff of 3 > > standard > > > deviations, i.e. collecting p*N + 3*(p*(1-p)*N)^1/2 documents for each > > > segment. For N=500, p=0.2, we can collect 67 documents instead of 500 > at > > > the cost of an error that occurs < 3/1000. > > > > > > Also note that the kind of errors we make are typically benign. In most > > > cases we will return the correct top N-1 documents, but instead of > > > returning the Nth-ranked document in position N, we return the N+1st. > > > > > > Implementing this in Lucene requires a small patch to TopFieldCollector > > to > > > introduce a leafHitsThreshold comparable to the existing > > > totalHitsThreshold. Given the possibility of error, it might be good to > > > have a way to disable this, but my inclination would be to enable it > > > whenever approximate counts are enabled (ie by default), and disable > when > > > totalHitsThreshold is MAX_VALUE. > > > > > > What do you think? Shall I open an issue? > > > > > > > > -- > > Adrien > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >