Re: TopItems.getTopUsers()

Grant Ingersoll Wed, 09 Nov 2011 05:52:55 -0800

On Nov 8, 2011, at 11:24 PM, Sean Owen wrote:

> The PriorityQueue there is a min heap. It's used to keep finding the
> smallest among a collection of large values. So, no it can't be used
> directly to create a list of items from large to small as it has a
> "get smallest" method, not "get largest".

Notice the code in Lucene.  It is solving the exact same problem.  Return the 
top X things by score.  Our ScoreDoc is made of a score and an id.

> 
> The result of the method is a list of IDs, not a list of item-score
> pairs. This is the reason for the last step where it's converted into
> just the IDs.

Yeah, I know.  But why all the shuffling around?  We have all we need on the PQ 
already.

> 
> The size of queue here is typically really small, like 10 items. The
> queue implementation probably makes no measurable difference at that
> size.

Multiplied by millions or requests, it adds up.  I think Recommenders, 
especially, are to the point where it's time to start tightening the screws on 
performance.   I've been reviewing this stuff quite a bit lately and am struck 
by how similar this stuff is to Lucene's core scoring, albeit from a few years 
ago (pre Lucene 2.4, by my guess).  For instance, our "IDRescorer", is modeled 
via a Collector class and bit set filters and it's integrated directly into the 
scoring, whereas Mahout sorts after scoring.

> 
> On Wed, Nov 9, 2011 at 12:46 AM, Grant Ingersoll <[email protected]> wrote:
>> I've been reading code and am wondering about TopItems.getTopUsers
>> 
>> Here's my pseudocode of it (lines 96-134 in TopItems)
>> Get a Priority Queue (PQ)
>> for all users
>>        estimate the similarity of user i with our current user (the one we 
>> are generating a rec for)
>>        put a SimilarUser object on the PQ.
>> 
>> //HERE
>> Create a list of SimilarUser objects
>> Populate that list with the PQ objects, which also contains SimilarUser 
>> objects
>> Sort that list
>> Iterate over that list once more and grab the ids and put it into an array
>> //HERE
>> return the array
>> 
>> Why all that in between moving stuff marked by //HERE?  Isn't that 
>> traversing the same list 3 times?  Can't we just pop from the priority queue 
>> in reverse order and fill the array from back to front?  Am I missing 
>> something?
>> 
>> Here's the same code in Lucene:
>> <snip>
>> protected void populateResults(ScoreDoc[] results, int howMany) {
>>    for (int i = howMany - 1; i >= 0; i--) {
>>      results[i] = pq.pop();
>>    }
>>  }
>> </snip>
>> 
>> Also, FWIW, Lucene's PQ implementation is faster than Java's, from what I 
>> understand (which is why we wrote our own).
>> 
>> -Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: TopItems.getTopUsers()

Reply via email to