On Nov 8, 2011, at 11:24 PM, Sean Owen wrote: > The PriorityQueue there is a min heap. It's used to keep finding the > smallest among a collection of large values. So, no it can't be used > directly to create a list of items from large to small as it has a > "get smallest" method, not "get largest".
Notice the code in Lucene. It is solving the exact same problem. Return the top X things by score. Our ScoreDoc is made of a score and an id. > > The result of the method is a list of IDs, not a list of item-score > pairs. This is the reason for the last step where it's converted into > just the IDs. Yeah, I know. But why all the shuffling around? We have all we need on the PQ already. > > The size of queue here is typically really small, like 10 items. The > queue implementation probably makes no measurable difference at that > size. Multiplied by millions or requests, it adds up. I think Recommenders, especially, are to the point where it's time to start tightening the screws on performance. I've been reviewing this stuff quite a bit lately and am struck by how similar this stuff is to Lucene's core scoring, albeit from a few years ago (pre Lucene 2.4, by my guess). For instance, our "IDRescorer", is modeled via a Collector class and bit set filters and it's integrated directly into the scoring, whereas Mahout sorts after scoring. > > On Wed, Nov 9, 2011 at 12:46 AM, Grant Ingersoll <[email protected]> wrote: >> I've been reading code and am wondering about TopItems.getTopUsers >> >> Here's my pseudocode of it (lines 96-134 in TopItems) >> Get a Priority Queue (PQ) >> for all users >> estimate the similarity of user i with our current user (the one we >> are generating a rec for) >> put a SimilarUser object on the PQ. >> >> //HERE >> Create a list of SimilarUser objects >> Populate that list with the PQ objects, which also contains SimilarUser >> objects >> Sort that list >> Iterate over that list once more and grab the ids and put it into an array >> //HERE >> return the array >> >> Why all that in between moving stuff marked by //HERE? Isn't that >> traversing the same list 3 times? Can't we just pop from the priority queue >> in reverse order and fill the array from back to front? Am I missing >> something? >> >> Here's the same code in Lucene: >> <snip> >> protected void populateResults(ScoreDoc[] results, int howMany) { >> for (int i = howMany - 1; i >= 0; i--) { >> results[i] = pq.pop(); >> } >> } >> </snip> >> >> Also, FWIW, Lucene's PQ implementation is faster than Java's, from what I >> understand (which is why we wrote our own). >> >> -Grant -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
