SparkBindings on a real cluster

Sebastian Schelter Wed, 04 Jun 2014 01:01:25 -0700

Hi,

I did some experimentation with the spark bindings on a real clusteryesterday, as I had to run some experiments for a paper (unrelated toMahout) that I'm currently writing. The experiment basically consists ofmultiplying a sparse data matrix by a super-sparse permutation-likematrix from the left. It took me the whole day to get it working, up tomatrices with 500M entries.

I ran into lots of issues that we have to fix asap, unfortunately Idon't have much time in the next weeks, so I'm just sharing a list ofthe issues that I ran into (maybe I'll find some time to create issuesfor these things on the weekend).

I think the major challenge for us will be to get choice of dense/sparsecorrect and put lots of work into memory efficiency. This could be agreat hook for collaborating with the h20 folks, as they know how tomake vector-like data small and computations fast.


Here's the list:

* our matrix serialization in MatrixWritable is seriously flawed, I raninto the following errors

- the type information is stored with every vector although a matrixalways only contains vectors of the same type- all entries of a TransposeView (and possibly other views) of asparse matrix are serialized, resulting in OOM- for sparse row matrices, the vectors are set using assign insteadof via constructor injection, this results in huge memory consumptionand long creation times, as in some implementations, binary search isused for assignment

* a dense matrix is converted into a SparseRowMatrix with dense rowvectors by blockify(), after serialization this becomes a dense matrixin sparse format (triggering OOMs)!

* drmFromHDFS does not have an option to set the number of desiredpartitions

* SparseRowMatrix with sequential vectors times SparseRowMatrix withsequential vectors is totally broken, it uses three nested loops anduses get(row, col) on the matrices, which internally uses binary search...

* At operator adds the column vectors it creates, this is unnecessary aswe don't need the addition, we can just merge the vectors

* we need a dedicated operator for inCoreA %*% drmB, currently this getsrewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (Ihave a prototype of that operator)


Best,
Sebastian

SparkBindings on a real cluster

Reply via email to