I'm going ahead with implementing something like this as a simple but real distributed implementation.
One question that maybe Ted already knows the answer to: how would I iterate over A_u* for each A_u*? the mapper would touch only one each. I guess I could use a whole MR step to generate the cartesian product so to speak. On Thu, Sep 10, 2009 at 1:44 AM, Ted Dunning <[email protected]> wrote: > Another implementation is to iterate over users emitting cooccurrence > pairs. These can be sorted on the items involved and the counts > aggregated. This is ideal for map reduce implementation. > > Roughly the matrix approach becomes this: > > Map: > input record is A_u*, all items for a particular user u > for each item i in A_u* > yield (( > yield ((i,*), 1) > for each item j in A_u* > yield ((i, j), 1) > > Reduce: > input record is (i,j), (k1, k2, ... kn) > yield (i,j), sum k* > > or > > input record is (i,*), (k1, k2, ... ,kn) > yield (i,*) sum k* > > This computes the elements of A'A (item cooccurrence counts) and the row > sums of A'A in one step. You might want to compute the column sums of A > first to use these for weighting A before using this program. > > Then you need another MR step to bring together the overall total, the row > counts and the cell counts for final reduction. > > This is inside out compared with the most obvious implementation and there > is no place where we really have two entire item vectors. This will make it > rather harder to have a totally simple pluggable weighting scheme that runs > as fast.
