Re: SparkBindings on a real cluster

Dmitriy Lyubimov Wed, 04 Jun 2014 15:26:01 -0700

@Seb: submitted PRs containing simple fixes for the ones having exlamation
marks... assuming they are critical. If you could test/comment, would be
awesome.



On Wed, Jun 4, 2014 at 12:59 AM, Sebastian Schelter <[email protected]> wrote:

> Hi,
>
> I did some experimentation with the spark bindings on a real cluster
> yesterday, as I had to run some experiments for a paper (unrelated to
> Mahout) that I'm currently writing. The experiment basically consists of
> multiplying a sparse data matrix by a super-sparse permutation-like matrix
> from the left. It took me the whole day to get it working, up to matrices
> with 500M entries.
>
> I ran into lots of issues that we have to fix asap, unfortunately I don't
> have much time in the next weeks, so I'm just sharing a list of the issues
> that I ran into (maybe I'll find some time to create issues for these
> things on the weekend).
>
> I think the major challenge for us will be to get choice of dense/sparse
> correct and put lots of work into memory efficiency. This could be a great
> hook for collaborating with the h20 folks, as they know how to make
> vector-like data small and computations fast.
>
> Here's the list:
>
> * our matrix serialization in MatrixWritable is seriously flawed, I ran
> into the following errors
>
>   - the type information is stored with every vector although a matrix
> always only contains vectors of the same type
>   - all entries of a TransposeView (and possibly other views) of a sparse
> matrix are serialized, resulting in OOM
>   - for sparse row matrices, the vectors are set using assign instead of
> via constructor injection, this results in huge memory consumption and long
> creation times, as in some implementations, binary search is used for
> assignment
>
> * a dense matrix is converted into a SparseRowMatrix with dense row
> vectors by blockify(), after serialization this becomes a dense matrix in
> sparse format (triggering OOMs)!
>
> * drmFromHDFS does not have an option to set the number of desired
> partitions
>
> * SparseRowMatrix with sequential vectors times SparseRowMatrix with
> sequential vectors is totally broken, it uses three nested loops and uses
> get(row, col) on the matrices, which internally uses binary search...
>
> * At operator adds the column vectors it creates, this is unnecessary as
> we don't need the addition, we can just merge the vectors
>
> * we need a dedicated operator for inCoreA %*% drmB, currently this gets
> rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have
> a prototype of that operator)
>
> Best,
> Sebastian
>
>
>

Re: SparkBindings on a real cluster

Reply via email to