On Wed, Nov 9, 2011 at 1:52 PM, Grant Ingersoll <[email protected]> wrote: > Notice the code in Lucene. It is solving the exact same problem. Return the > top X things by score. Our ScoreDoc is made of a score and an id.
I'm sure it does -- but I thought the question was why this code was the way it is, and why you couldn't just use the queue as-is? I am sure the Lucene equivalent must do something similar. You can't solve this with a max-heap, if that's what you mean. > Yeah, I know. But why all the shuffling around? We have all we need on the > PQ already. Yes, but not in order! What are you suggesting would work, that is simpler? > Multiplied by millions or requests, it adds up. I think Recommenders, > especially, are to the point where it's time to start tightening the screws > on performance. I've been reviewing this stuff quite a bit lately and am > struck by how similar this stuff is to Lucene's core scoring, albeit from a > few years ago (pre Lucene 2.4, by my guess). For instance, our "IDRescorer", > is modeled via a Collector class and bit set filters and it's integrated > directly into the scoring, whereas Mahout sorts after scoring. This is nowhere near the bottleneck, I assure you, but invite you to profile it to have a look. This piece of code has been profiled and scrubbed many times over. I don't understand the comment about sorting after scoring -- you can't, in this context.
