> Also, I really do not like the large perf hits at highish topN sizes. > Admittedly it's unusual (though I suspect not THAT rare) to use such > a large topN, but, I esepically don't like that it doesn't degrade > gracefully – it's surprising. Likewise I'd like to see with highish > number of segments how things degrade.
Well the degradation should not be surprising - it's coded into the subleading complexity: redoing the complexity argument carefully for the average case, probabilistically, the flop count for single PQ, for top T docs out of S segments of H hits in each, is^(footnote1) O(S*H + T*log(T)*log(S*H)) (note that it is a *sum* of the two terms) while inserting in S different queues is^(footnote2) O(S*H + T*log(T)*log(H)*S ) This means that multi-PQ has a subleading term with S*log(H) / log(S*H) worse performance, as a function of T. To see whether this is "bad", we have to consider the full range of S = numSegments, H = numHitsPerSegment and T = topNumHitsToReturn. If you fix S and H (which fixes the dominating factor as constant also), then as T increases, as the tests show, multi-PQ diverges, getting worse as T goes up. On the other hand, if you fix T, and instead let S and H vary, then the flop-count ratio is the following: flops_multiPQ / flops_singlePQ = S*H + T*log(T)*log(S*H) -------------------------------- = S*H + T*log(T)*log(H)*S 1 + T*log(T)*log(S*H)/S*H ---------------------------------- = 1 + T*log(T)*log(H)*S/S*H under the always true (for us) case that S*H is the dominating factor, and using 1/(1+x) =~ 1-x, for small x, we have: (1 + T*log(T)*log(S*H)/S*H ) * (1 - T*log(T)*log(H)/H ) using (1+x)*(1-y) = 1+x-y for small x, y, and pulling out the common factor of T*log(T)/H, we have: 1 - ( T*log(T)/H ) * ( log(H) - (log(S*H) / S) ) What's the end result? If you fix the queue size (T), then as the number of hits and the number of hits per segment grows, the percentage performance difference should go to *zero* proportional to 1/H = 1/numHitsPerSegment. This to me is pretty important to note, and should be pretty easy to see in a test run: doing it on a run with 5 million documents hit by the query should work out much better for the multiPQ than singlePQ, whereas *decreasing* the hit count, to say 10,000 hits, should penalize multiPQ versus singlePQ. The average user I would imagine doesn't have *tons* of variation in queue-size or number of segments (meaning I doubt either grows often above 100, and very rarely up to 1000), but the variation in total number of hits is extremely variable, and as even recent discussions on java-user show, people even look at multi-TB indexes, so performance as a function of index size is probably the most important. I can run these numbers on my mac desktop, because this analysis is independent of hardware or even java version, since it's a scaling argument, but we should really see what the results are like on reasonable hardware for both the small hit-count case (where the performance might be really bad for singlePQ, and rule it out altogether), and the very high hit-count case (where the difference should be much more in favor of multiPQ, and in fact could rule out singlePQ if the constant factors (like the cost of the more string compares that singlePQ has to do) become more important than the complexity, because the asymptotics are the same in this regime). -jake ------ (1) To see this scaling, if you have a queue of size T, and you have currently looked at X hits, then the probability in the average case of the next hit beating the bottom element of the queue is T/X. Then the total number of insertions in the queue you have to do in a set of H hits is sum_X=1...H [ T/X ]. Replacing the sum with an integral, it's just T log(H). The cost of an insertion into a PQ of size T is log(T), giving the flop count based on insertions of log(T) * T * log(H) for insertions while walking H hits, and log(T)*T*log(S*H) for insertions while walking S*H hits. (2) we're neglecting the final merge for the multiPQ case, because its sub-sub-leading in complexity: merging S sorted lists of T elements is O(T log(S)) on average, which is smaller than the next bigger term in this counting by a multiplicative factor of log(T) * log(H).