Re: SparkBindings on a real cluster

Dmitriy Lyubimov Wed, 04 Jun 2014 09:44:37 -0700

thanks for identifying these issues. This is great help.

as well expected, in-memory stuff performance is (and will always be as
power law algorithms will stay predominantly cpu-bound) a function of
mahout-math (i.e. in-core math). So i think we can put at rest the "Spark
performance crashed by something" discussions.

Re: in-memory efficiency. I am dubious about h2o involvement with the set
of thier current techniques for a few reasons.

One, their compression techniques do not fit our in-core algebra use cases
(or fit them as corner cases only such as indicator matrix but not say even
ternary matrix). I don't want to extend the argument base here, but can
explain it on a side if it is to pursue any pragmatical merit.

Second, the compression issue itself is not the main issue here. A large
portion of it is cost-based algorithm application on matrix-involved
operations, which we simply don't have for stuff like open-hash based
sparse matrices.

Third, purely from project management perspective, 0xdata has already been
called upon to participate on these issues more than once and chose to stay
away. They've been  clear they are extolling merits for a particular
business interest here. Naturally we can't expect them fixing existing
Mahout stuff beyond that stated business interest. Mahout's Colt-derived
in-core math clearly has not been that scope.

Re: Matrix serialization issues: So it's even worse than i thought then.
This sounds serious.

Re: blockify: i thought there (used to be, at least) an attempt to make an
assessment of incoming data sparsity by looking at first incoming vector
type,  and subsequently treat the block data accordingly. I can look into
it.

Re: drmFromHdfs: this method parititions based on data affinity. There's no
need to mess with this process additionally on this level. There may be an
issue of further re-splitting the data for the purposes of improved task
scheduling (one of the issues identified in my talk), but that is not a
business of loader, and it is not limited to the case of initial loading.
This needs a bit more thought rather than hacking the loader IMO.

On Wed, Jun 4, 2014 at 12:59 AM, Sebastian Schelter <[email protected]> wrote:

> Hi,
>
> * SparseRowMatrix with sequential vectors times SparseRowMatrix with
> sequential vectors is totally broken, it uses three nested loops and uses
> get(row, col) on the matrices, which internally uses binary search...
>

back to my point about cost-based matrix operations... But this is a huge
issue, it has been big for vectors, and is probably even bigger for
matrices. I wonder if I'd rather test it with Breeze, to see if they are
any better by now handling that mixed operand types stuff. Or they pretty
much just restricted themselves to a solid lapack integration. That would
completely de-mahout-ize the whole idea though but may save a lot of grief
dealing with mahout-math issues. I don't preclude that internally i may
have to do just that.

>
> * At operator adds the column vectors it creates, this is unnecessary as
> we don't need the addition, we can just merge the vectors
>

They need to be added to handle cases of matrices  with duplicate row keys.
 It is a bit weird but there's code using this fact (Lloyd iterations of
K-means, not in public code base). Are you sure vector addition cannot be
as efficient as concat, assuming we expect it to be _mostly_ a concat? Not
making much of an argument here, just pointing out existing use-case of
mine which this technique will not cover.

>
> * we need a dedicated operator for inCoreA %*% drmB, currently this gets
> rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have
> a prototype of that operator)
>

Well this involves one shuffle task. Outer transposition may still be
rewritten. (I guess in practice it didn't for you).

I did not have a chance to think about it much yet (since i haven't had the
case of left-multiply by in-core operand), but I couldn't think of any
algorithm that wouldn't involve at least one shuffle task. I'd be happy to
look at your prototype, esp. if it avoids shuffle tasks the way
right-multiply version does.

> Best,
> Sebastian
>
>
>

Re: SparkBindings on a real cluster

Reply via email to