> I think that the proposal under discussion involves adding a dependency on > a maven released h2o artifact plus a contribution of Mahout translation > layers. These layers would give a sub-class of Matrix (and Vector) which > allow direct control over life span across multiple jobs but would > otherwise behave like their in-memory counter-parts.
Well I suppose that means they have to live in some processes which are not processes I already have. And they have to be managed. So this is not just an in-core subsystem. Sounds like a new back to me. > > > > In Hadoop, every iteration must be scheduled as a separate job, rereads > > invariant data and materializes its result to hdfs. Therefore, iterative > > programs on Hadoop are an order of magnitude slower than on systems that > > have dedicated support for iterations. > > > > Does h2o help here or would we need to incorporate another system for such > > tasks? > > > > H2o helps here in a couple of different ways. > > The first and foremost is that primitive operations are easy > Additionally, data elements can survive a single programs execution. This > means that programs can be executed one after another to get composite > effects. This is astonishingly fast ... more along the speeds one would > expect from a single processor program. I think the problem here is that the authors keep comparing these techniques to slowest model available which is Hadoop. But this is exact execution model of Spark. You get stuff repeatedly executed on in-memory partitions and get approximately the speed of iterative speed execution. I won't describe it as astonishing, though, because indeed it is as fast as you can get things done in memory, no faster, no slower. That's for example the reason why my linalg optimizer is not hesitating to compute exact matrix geometry lazily if not known, for optimization purposes, because the answer will be back in between 40 to 200 ms, assuming adequate RAM allocation. I have been using these paradigms for more than a year now. This is all good stuff. I would not use word astonshing, but sensible, yes. Main concern is if programming model is called to be sacrificed just to do sensible things here. > > > (2) Efficient join implementations > > > > If we look at a lot of Mahout's algorithm implementations with a database > > hat on, than we see lots of handcoded joins in our codebase, because Hadoop > > does not bring join primitives. This has lots of drawbacks, e.g. it > > complicates the codebase and leads to hardcoded join strategies that bake > > certain assumptions into the code (e.g. ALS uses a broadcast-join which > > assumes that one side fits into memory on each machine, RecommenderJob uses > > a repartition-join which is scalable but very slow for small inputs,...). > > +1 > I think that h2o provides this but do not know in detail how. I do know > that many of the algorithms already coded make use of matrix multiplication > which is essentially a join operation. Essentially a join? The spark module optimizer picks out of at least 3 implementations: zip+combine, block-wise cartesian and finally, yes, join+combine. Depends on orientation and the earlier operators in pipeline. That's exactly my point about flexibility of programming model from the optimizer point of view. > > > Obviously, I'd love to get rid of handcoded joins and implement ML > > algorithms (which is hard enough on its own). Other systems help with this > > already. Spark, for example offers broadcast and repartition-join > > primitives, Stratosphere has a join primitive and an optimizer that > > automatically decides which join strategy to use, as well as a highly > > optimized hybrid hashjoin implementation that can gracefully go out-of-core > > under memory pressure. > > > > When you get into the realm of things on this level of sophistication, I > think that you have found the boundary where alternative foundations like > Spark and Stratosphere are better than h2o. The novelty with h2o is the > hypothesis that a very large fraction of interesting ML algorithms can be > implemented without this power. So far, this seems correct. Again, this is largely along the lines "let's make a library of few hand-optimized things". Which is noble, but -- I would argue -- not ambitious enough. Most of the distributed ML projects do just that. We should perhaps think along the lines what could be differentiating factor for us. Not that we should not care about performance. It should be, of course, *sensible*. (Our MR code base of course does not give us that, as u said, jumping off MR wagon is not even a question). If you can forgive me for drawing parallels here, it's a difference between something like Weka and R. Collection vs. platform _and_ collection induced by platform. Platform of course also positively feeds into the speed of collection growth directly. When i use R, i don't have code consisting of algorithms calls. That is, yes, it is doing off-the shelf use now and then, but it is far from being the only thing it is doing. 95% of the things is as simple feature massaging. I place no value in R for providing GLM for me. Gosh, this particular offering is virtually hanging from anywhere these days. But i do place value into it for doing custom feature prep and for, for example being able to get 100 grad students to try their own k-means implementation in seconds. Why? There has been a lot of talk here about building community and contributions etc. Platform is what builds it, most directly and amazingly. I would go on a limb here and say that Spark and mlib are experiencing explosive growth of contributions not because it can do things with in-memory datasets (which is important, but like i said, is has been long since viewed no more than just sensible), but because of clarity of its programming model. I think we have seen a very solid evidence that clarity and richness of programming model was the thing that attracts communities. If we grade roughly (very roughly!) what we have today, I can easily argue that the acceptance levels follow the programming model very closely. e.g. if i try to sort project with distributed programming models by (my subjectively percieved) popularity, from bottom to top : ******** Hadoop MapReduce -- ok i don't even know how to organize the critique here, too long of a list, almost nobody (but Mahout) does these things this way today. Certainly, none of my last 2 employers did. hive -- SQL like with severly constrained general programming language capabilities, not conducive to batches. Pretty much limits to ad-hoc exploration. Pig -- a bit better, can write batches, but extra functionality mixins (UDFs) are still a royal pain Cascading -- even easier, rich primitives, easy batches, some manual optimization of physical plan elements. One of the big cons is the limitation of a rigid dataset tuple structure, FlumeJava (Crunch in apache world) -- even better, but java closures are just plain ugly, zero "scriptability". Its community has been hurt a little bit because of the fact that it was a bit late to the show compared to others (e.g. cascading), but it leveled off quickly. Scala bindings for Cascading (Scalding) and FlumeJava -- better, hell, well better on the closure and FP front! But still not being native to scala from get go creates some miniature problems there. Spark -- i think is fair to say the current community "king" above those all -- all the aforementioned platform model pains are eliminated, although on performance side i think there're still some pockets for improvement on cost-based optimization side of things. Stratosphere might be more interesting in this department, but I am not sure at this point if that necessarily will translate into performance benefits for ML. ******** The first few things are using the same computing model underneath and essentially are having roughly the same performance. Yet there's clear variation in community and acceptance. In ML world, we are seeing approximately the same thing. The clearer the programming model and ease of integration in to the process, the wider the acceptance. I probably can pretty successfully argue that current most performant ML "thing" as it stands is GraphLab. And it is pretty comprehensive in problem coverage (I think it does cover e.g. recommender concerns greater than h2o and Mahout together, for example). But i can also pretty successfully argue it is being rejected a lot of time for being just a collection (which is, in addition, is hard to call from jvm, i.e. integration again). It is actually so bad, that people in my company would rather go back to 20 snow wired R servers than think of even entertaining an architecture including GraphLab component. (Yes, variance of this sample as high as it gets, just saying what i hear). So as a general guideline to solve the current ills, it would stand to reason to adopt platform priority and algorithm collection as a function of such platform, rather than collection as a function of few dedicated efforts. Yes -- it has to be *sensibly* performant -- but this does not have to be mostly a concern of the code in this project directly. Rather, it has to be a concern of the backs (i.e. dependencies) and our in-core support. Our pathological fear of being a performance scapegoat totally obscurs the fact that performance is mostly a function of the back and that we were riding on a wrong back for a long time. As long as we don't cling to a particular back, it shouldn't be a problem. What one would rather accept: being initially 5x slower than Graphlab (but on par with MLlib) but beat these on community support, or being on par but anemic in community? If 02 platform feels the performance has been so important to sacrifice programming model, why they feel the need to join an apache project? After all, they have been an open project for a long time already and have built their own community, big or small. Spark has just now become a top-level apache project, and joined apache incubator mere 2 months ago and did not have any trouble attracting community outside Apache at all. Stratosphere is not even in Apache. Similarly, did it help Mahout to be in Apache to get anywhere close in community measurement to these? So this totally refutes the argument one has to be an Apache project to get its exclusive qualities highlighted. Perhaps in the end it is more about the importance of the qualities to the community and quality of contributions. A lot of this platform and programming model priority is probably easier to say than do, but some of linalg and data frame things are ridiculously easy though in terms of amount of effort. If i could do linalg optmizer with bindings for sparks with 2 nights a month, the same can be done for multiple backs and data frames in a jiffy. Well, the back should have a clear programming model of course as a prerequisite. Which brings us back to the issue of richness of distributed primitives.