Hi,

I did some experimentation with the spark bindings on a real cluster yesterday, as I had to run some experiments for a paper (unrelated to Mahout) that I'm currently writing. The experiment basically consists of multiplying a sparse data matrix by a super-sparse permutation-like matrix from the left. It took me the whole day to get it working, up to matrices with 500M entries.

I ran into lots of issues that we have to fix asap, unfortunately I don't have much time in the next weeks, so I'm just sharing a list of the issues that I ran into (maybe I'll find some time to create issues for these things on the weekend).

I think the major challenge for us will be to get choice of dense/sparse correct and put lots of work into memory efficiency. This could be a great hook for collaborating with the h20 folks, as they know how to make vector-like data small and computations fast.

Here's the list:

* our matrix serialization in MatrixWritable is seriously flawed, I ran into the following errors

- the type information is stored with every vector although a matrix always only contains vectors of the same type - all entries of a TransposeView (and possibly other views) of a sparse matrix are serialized, resulting in OOM - for sparse row matrices, the vectors are set using assign instead of via constructor injection, this results in huge memory consumption and long creation times, as in some implementations, binary search is used for assignment

* a dense matrix is converted into a SparseRowMatrix with dense row vectors by blockify(), after serialization this becomes a dense matrix in sparse format (triggering OOMs)!

* drmFromHDFS does not have an option to set the number of desired partitions

* SparseRowMatrix with sequential vectors times SparseRowMatrix with sequential vectors is totally broken, it uses three nested loops and uses get(row, col) on the matrices, which internally uses binary search...

* At operator adds the column vectors it creates, this is unnecessary as we don't need the addition, we can just merge the vectors

* we need a dedicated operator for inCoreA %*% drmB, currently this gets rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have a prototype of that operator)

Best,
Sebastian


Reply via email to