Hi,
I did some experimentation with the spark bindings on a real cluster
yesterday, as I had to run some experiments for a paper (unrelated to
Mahout) that I'm currently writing. The experiment basically consists of
multiplying a sparse data matrix by a super-sparse permutation-like
matrix from the left. It took me the whole day to get it working, up to
matrices with 500M entries.
I ran into lots of issues that we have to fix asap, unfortunately I
don't have much time in the next weeks, so I'm just sharing a list of
the issues that I ran into (maybe I'll find some time to create issues
for these things on the weekend).
I think the major challenge for us will be to get choice of dense/sparse
correct and put lots of work into memory efficiency. This could be a
great hook for collaborating with the h20 folks, as they know how to
make vector-like data small and computations fast.
Here's the list:
* our matrix serialization in MatrixWritable is seriously flawed, I ran
into the following errors
- the type information is stored with every vector although a matrix
always only contains vectors of the same type
- all entries of a TransposeView (and possibly other views) of a
sparse matrix are serialized, resulting in OOM
- for sparse row matrices, the vectors are set using assign instead
of via constructor injection, this results in huge memory consumption
and long creation times, as in some implementations, binary search is
used for assignment
* a dense matrix is converted into a SparseRowMatrix with dense row
vectors by blockify(), after serialization this becomes a dense matrix
in sparse format (triggering OOMs)!
* drmFromHDFS does not have an option to set the number of desired
partitions
* SparseRowMatrix with sequential vectors times SparseRowMatrix with
sequential vectors is totally broken, it uses three nested loops and
uses get(row, col) on the matrices, which internally uses binary search...
* At operator adds the column vectors it creates, this is unnecessary as
we don't need the addition, we can just merge the vectors
* we need a dedicated operator for inCoreA %*% drmB, currently this gets
rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I
have a prototype of that operator)
Best,
Sebastian
- SparkBindings on a real cluster Sebastian Schelter
-