thanks for identifying these issues. This is great help. as well expected, in-memory stuff performance is (and will always be as power law algorithms will stay predominantly cpu-bound) a function of mahout-math (i.e. in-core math). So i think we can put at rest the "Spark performance crashed by something" discussions.
Re: in-memory efficiency. I am dubious about h2o involvement with the set of thier current techniques for a few reasons. One, their compression techniques do not fit our in-core algebra use cases (or fit them as corner cases only such as indicator matrix but not say even ternary matrix). I don't want to extend the argument base here, but can explain it on a side if it is to pursue any pragmatical merit. Second, the compression issue itself is not the main issue here. A large portion of it is cost-based algorithm application on matrix-involved operations, which we simply don't have for stuff like open-hash based sparse matrices. Third, purely from project management perspective, 0xdata has already been called upon to participate on these issues more than once and chose to stay away. They've been clear they are extolling merits for a particular business interest here. Naturally we can't expect them fixing existing Mahout stuff beyond that stated business interest. Mahout's Colt-derived in-core math clearly has not been that scope. Re: Matrix serialization issues: So it's even worse than i thought then. This sounds serious. Re: blockify: i thought there (used to be, at least) an attempt to make an assessment of incoming data sparsity by looking at first incoming vector type, and subsequently treat the block data accordingly. I can look into it. Re: drmFromHdfs: this method parititions based on data affinity. There's no need to mess with this process additionally on this level. There may be an issue of further re-splitting the data for the purposes of improved task scheduling (one of the issues identified in my talk), but that is not a business of loader, and it is not limited to the case of initial loading. This needs a bit more thought rather than hacking the loader IMO. On Wed, Jun 4, 2014 at 12:59 AM, Sebastian Schelter <[email protected]> wrote: > Hi, > > * SparseRowMatrix with sequential vectors times SparseRowMatrix with > sequential vectors is totally broken, it uses three nested loops and uses > get(row, col) on the matrices, which internally uses binary search... > back to my point about cost-based matrix operations... But this is a huge issue, it has been big for vectors, and is probably even bigger for matrices. I wonder if I'd rather test it with Breeze, to see if they are any better by now handling that mixed operand types stuff. Or they pretty much just restricted themselves to a solid lapack integration. That would completely de-mahout-ize the whole idea though but may save a lot of grief dealing with mahout-math issues. I don't preclude that internally i may have to do just that. > > * At operator adds the column vectors it creates, this is unnecessary as > we don't need the addition, we can just merge the vectors > They need to be added to handle cases of matrices with duplicate row keys. It is a bit weird but there's code using this fact (Lloyd iterations of K-means, not in public code base). Are you sure vector addition cannot be as efficient as concat, assuming we expect it to be _mostly_ a concat? Not making much of an argument here, just pointing out existing use-case of mine which this technique will not cover. > > * we need a dedicated operator for inCoreA %*% drmB, currently this gets > rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have > a prototype of that operator) > Well this involves one shuffle task. Outer transposition may still be rewritten. (I guess in practice it didn't for you). I did not have a chance to think about it much yet (since i haven't had the case of left-multiply by in-core operand), but I couldn't think of any algorithm that wouldn't involve at least one shuffle task. I'd be happy to look at your prototype, esp. if it avoids shuffle tasks the way right-multiply version does. > Best, > Sebastian > > >
