I really should get my partially finished version of this in there... it seems you guys keep converging closer and closer to my weird matrix-triple-product way of doing it as time goes on. :)
But yes, in general: avoiding MapFiles always helps. Hadoop is designed for bulk sequential access, and letting it do that allows for maximal throughput, doing anything else is... fraught with peril. -jake On Fri, Apr 23, 2010 at 2:44 AM, Sean Owen <sro...@gmail.com> wrote: > I thought it might be worth bringing this back to the user list. > > Ankur effectively raised issues about the performance of > org.apache.mahout.cf.taste.hadoop.item by adding > org.apache.mahout.cf.taste.hadoop.cooccurrence, which is a similar > recommender job (item cooccurrence-based) but with a different > implementation. ".item" ultimately does not distribute the matrix-user > vector multiply, and ".coocurrence" highly distributes it. > > .item accomplished this by side-loading the co-occurrence matrix into > a reducer, by accessing it from disk as MapFiles. This way of > accessing columns proved to be very slow. > > After much experimentation, I've completely overhauled .item by > grafting in ideas from .cooccurrence. It is a sort of > best-of-both-worlds hybrid of the two. It borrows a clever way to join > two kinds of input into one MapReduce, in order to join the > co-occurrence matrix columns and individual elements of each user > vector. The product is output and recombined later. This hybrid > retains features of .item like accommodating user ratings. > > Letting Hadoop manage the data flow, even though it takes a bit more > copying, avoiding reading from MapFile in a random-access manner, > using features like the Combiner, and being smarter about Writables > has sped this up for me by at least a factor of 10 -- mostly that > avoiding MapFiles. > > I bring it up since it's interesting, a good development for anyone > using this implementation, and an area that is ripe for more testing > and improvement I imagine. > > Sean >