@Seb: submitted PRs containing simple fixes for the ones having exlamation marks... assuming they are critical. If you could test/comment, would be awesome.
On Wed, Jun 4, 2014 at 12:59 AM, Sebastian Schelter <[email protected]> wrote: > Hi, > > I did some experimentation with the spark bindings on a real cluster > yesterday, as I had to run some experiments for a paper (unrelated to > Mahout) that I'm currently writing. The experiment basically consists of > multiplying a sparse data matrix by a super-sparse permutation-like matrix > from the left. It took me the whole day to get it working, up to matrices > with 500M entries. > > I ran into lots of issues that we have to fix asap, unfortunately I don't > have much time in the next weeks, so I'm just sharing a list of the issues > that I ran into (maybe I'll find some time to create issues for these > things on the weekend). > > I think the major challenge for us will be to get choice of dense/sparse > correct and put lots of work into memory efficiency. This could be a great > hook for collaborating with the h20 folks, as they know how to make > vector-like data small and computations fast. > > Here's the list: > > * our matrix serialization in MatrixWritable is seriously flawed, I ran > into the following errors > > - the type information is stored with every vector although a matrix > always only contains vectors of the same type > - all entries of a TransposeView (and possibly other views) of a sparse > matrix are serialized, resulting in OOM > - for sparse row matrices, the vectors are set using assign instead of > via constructor injection, this results in huge memory consumption and long > creation times, as in some implementations, binary search is used for > assignment > > * a dense matrix is converted into a SparseRowMatrix with dense row > vectors by blockify(), after serialization this becomes a dense matrix in > sparse format (triggering OOMs)! > > * drmFromHDFS does not have an option to set the number of desired > partitions > > * SparseRowMatrix with sequential vectors times SparseRowMatrix with > sequential vectors is totally broken, it uses three nested loops and uses > get(row, col) on the matrices, which internally uses binary search... > > * At operator adds the column vectors it creates, this is unnecessary as > we don't need the addition, we can just merge the vectors > > * we need a dedicated operator for inCoreA %*% drmB, currently this gets > rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have > a prototype of that operator) > > Best, > Sebastian > > >
