Another implementation is to iterate over users emitting cooccurrence
pairs. These can be sorted on the items involved and the counts
aggregated. This is ideal for map reduce implementation.
Roughly the matrix approach becomes this:
Map:
input record is A_u*, all items for a particular user u
for each item i in A_u*
yield ((
yield ((i,*), 1)
for each item j in A_u*
yield ((i, j), 1)
Reduce:
input record is (i,j), (k1, k2, ... kn)
yield (i,j), sum k*
or
input record is (i,*), (k1, k2, ... ,kn)
yield (i,*) sum k*
This computes the elements of A'A (item cooccurrence counts) and the row
sums of A'A in one step. You might want to compute the column sums of A
first to use these for weighting A before using this program.
Then you need another MR step to bring together the overall total, the row
counts and the cell counts for final reduction.
This is inside out compared with the most obvious implementation and there
is no place where we really have two entire item vectors. This will make it
rather harder to have a totally simple pluggable weighting scheme that runs
as fast.
On Wed, Sep 9, 2009 at 3:42 PM, Gökhan Çapan <[email protected]> wrote:
> 3. After the change:
> a. Matrix approach becomes:
>
> userSet=Intersect ( nonZeroElements(A' i), nonZeroElements(A' j) );
> weight:=0;
> for each user u in userSet
> weight+= rating(u,X)*rating(u,Y);
>
>
> b: Taste's implementation becomes:
>
> userSet=Intersect ( nonZeroElements(A' i), nonZeroElements(A' j) );
> weight:=0;
> for each user u in userSet
> updateWeightWithChosenSimilarityMetric( );
>
>
> I think 3rd version is already benefiting from sparsity of data; obviously,
> as the data set gets sparser and/or # of users increases, that trick will
> speed up the algorithm more. So, doesn't it mean 3rd is an algorithm that
> uses sparse data structures that you mentioned?
>
--
Ted Dunning, CTO
DeepDyve