On Fri, Dec 4, 2009 at 12:33 AM, Sean Owen <[email protected]> wrote: > Yes, this makes sense. I do need two passes. One pass converts input > from "user,item,rating" triples into user vectors. Then the second > step builds the co-occurrence A'A product. I agree that it will be > faster to take a shortcut than properly compute A'A. >
You can work directly with the triples as input if you have relatively small input. Mapper puts user as key, and (item) as value (this is cooccurrence only so rating should only be used to filter which items get output). Reducer takes list of items and produces cross product on output. It doesn't need to keep the full cross product. Second MR positions (item1, item2) as key and basically does a word count. This approach has the problem of a large, uncombinable intermediate file. The preferable approach is for the first MR step to group by user as before, then in the reduce down-sample the user items if desired and output that list in a single record. Down-sampling can be done on-line keeping just the retained elements in memory. Second MR would produce the cross product in the mapper and use a combiner and reducer. If we wanted to follow up on Jake's idea, the transpose can be done in the first MR step but it is hard to down-sample the user's items that way. The first step of converting triples to vectors should also produce user and item counts so that the second step can do the sparsification cleanly. > (Though I'm curious how this works -- looks deceptively easy, this > outer product approach. Isn't v cross v potentially huge? or likely to > be sparse enough to not matter) > The cost is dominated by the square of the number of items that the most prolific user has. If you down sample that to something reasonable, then you bound the total cost very tightly. It is very unusual for the massively prolific users to add any information ... generally they are spammers or bots anyway. > I understand the final step in principle, which is to compute (A'A)h. > But I keep guessing A'A is too big to fit in memory? Correct. (A'A) h can be computed in several ways, but it all comes down to the fact that h is very sparse. Typically you make it even sparser by keeping only recent history. If you have only 50 non-zeros in h, then you only need 50 columns of (A'A). These can be retrieved many different ways, but one cool way is to make each row of A'A be a Lucene document. The terms in the documents are items and the columns of A'A are the posting vectors in Lucene. The weighting that Lucene does generally helps but can easily be defeated if desired. Another approach is to make each column of A'A be stored in a key-value store. At recommendation time, you retrieve columns and add them. This is essentially equivalent to the Lucene approach without lucene. Because we know a lot about the contents (they are integers), you can probably write tighter code than Lucene can use. This would be a great use for the fancy concurrent map builder that is in Google collections, for instance. -- Ted Dunning, CTO DeepDyve
