Yes, this makes sense. I do need two passes. One pass converts input from "user,item,rating" triples into user vectors. Then the second step builds the co-occurrence A'A product. I agree that it will be faster to take a shortcut than properly compute A'A.
(Though I'm curious how this works -- looks deceptively easy, this outer product approach. Isn't v cross v potentially huge? or likely to be sparse enough to not matter) I understand the final step in principle, which is to compute (A'A)h. But I keep guessing A'A is too big to fit in memory? So I can side-load the rows of A'A one at a time and compute it rather manually. On Thu, Dec 3, 2009 at 8:28 PM, Ted Dunning <[email protected]> wrote: > I think you can merge my passes into a single pass in which you compute the > row and column sums at the same time that you compute the product. That is > more complicated, though, and I hate fancy code. So you are right in > practice that I have always had two passes. (although pig might be clever > enough by now to merge them) > > There is another pass in which you use all of the sums to do the > sparsification. I don't know if that could be done in the same pass or not.
