I thought it might be worth bringing this back to the user list.

Ankur effectively raised issues about the performance of
org.apache.mahout.cf.taste.hadoop.item by adding
org.apache.mahout.cf.taste.hadoop.cooccurrence, which is a similar
recommender job (item cooccurrence-based) but with a different
implementation. ".item" ultimately does not distribute the matrix-user
vector multiply, and ".coocurrence" highly distributes it.

.item accomplished this by side-loading the co-occurrence matrix into
a reducer, by accessing it from disk as MapFiles. This way of
accessing columns proved to be very slow.

After much experimentation, I've completely overhauled .item by
grafting in ideas from .cooccurrence. It is a sort of
best-of-both-worlds hybrid of the two. It borrows a clever way to join
two kinds of input into one MapReduce, in order to join the
co-occurrence matrix columns and individual elements of each user
vector. The product is output and recombined later. This hybrid
retains features of .item like accommodating user ratings.

Letting Hadoop manage the data flow, even though it takes a bit more
copying, avoiding reading from MapFile in a random-access manner,
using features like the Combiner, and being smarter about Writables
has sped this up for me by at least a factor of 10 -- mostly that
avoiding MapFiles.

I bring it up since it's interesting, a good development for anyone
using this implementation, and an area that is ripe for more testing
and improvement I imagine.

Sean

Reply via email to