Wow that's a lot of thinking. Maybe we can back up a bit and speak in
generalities.

In this world there's no such thing as cramming the data set into
memory. SparseMatrix is out. There isn't any effiicient way to access
arbitrary data, at all. You'll think in terms of an algorithm that
proceeds in a series of large, bulk operations (e.g. matrix times
matrix) which are decomposable into small, independent computations.

Simple ideas like "divide everything by the count of all users" can be
surprisingly tricky. You have to count, output, then inject that value
into a new mapreduce.

You can see my implementation of a simplistic co-occurrence based
recommender in org.apache.mahout.cf.taste.hadoop. It proceeds mostly
in terms of operations on vectors and it's a little tangled.

DistributedRowMatrix may streamline a lot of this, I'm not personally
familiar with it (but sounds like an excellent sort of idea). I
operate in terms of Vectors, and "matrices" as lots of Vectors
together. There's never an operation whose atomic component is in
terms of "a whole matrix".

That may send you back to the drawing boards a bit, but this is the
structure / data structure to have in mind on Hadoop.

others I'm sure have more nuanced ideas here.

Reply via email to