H2O's unknown only to become known. All of us have watched every open
source phenomenon, including successful go through that phase. Linux,
Apache, Hadoop, even upto recently Spark were all targets of fear and
uncertainty. I'm a fan of Spark and Matei's relentless pursuit over
the years. Quantitative primitives wasn't not the focus for them.

Dimitriy's point on programming model is a good one -
- Our programming model is map reduce on a distributed chunked k/v
store. Very plain jane as it gets.
- We don't feel competitive with Spark;
An algorithmic designer should be able to define algorithms that run
on multiple architectures.
H2O can easily embrace Spark at the Scala/MLI layer or at the RDD data
ingest/store layer.
Some of our users use SHARK for pre-processing and H2O for the machine
learning.

Reality is there is no architectural silver bullet for any good large
body of real world use cases;
Interoperability & heterogeneity in the data center and developers is
given. We should be open to embracing that.

- The point about better documentation of the architecture is
well-taken. And something that is being addressed.
 The Algorithms themselves are well documented and work as advertised,
in production environments.
   (- The product takes the documentation with it.)

Let me segue to a present a simple LinearRegression program on H2O,
(one we use in some of our meetups & community efforts.)
https://github.com/0xdata/h2o/blob/master/src/main/java/hex/LR2.java

*The commentary for the code -*
*1. Breakdown problem in discrete phases.*


// Pass 1: compute sums & sums-of-squares

  CalcSumsTask lr1 = new CalcSumsTask().doAll(vec_x, vec_y);


// Pass 2: Compute squared errors

 final double meanX = lr1._sumX/nrows;

 final double meanY = lr1._sumY/nrows;

 CalcSquareErrorsTasks lr2 = new CalcSquareErrorsTasks(meanX,
meanY).doAll(vec_x, vec_y);


// Pass 3: Compute the regression

 beta1 = lr2._XYbar / lr2._XXbar;

 beta0 = meanY - beta1 * meanX;

 CalcRegressionTask lr3 = new CalcRegressionTask(beta0, beta1,
meanY).doAll(vec_x, vec_y);


*2. Use Map / Reduce programming model for the Tasks *
* - Think of chunks as units of batch over data.*

  public static class CalcSumsTask extends MRTask2<CalcSumsTask> {

    long _n; // Rows used

    double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's

    @Override public void map( Chunk xs, Chunk ys ) {

      for( int i=0; i<xs._len; i++ ) {

        double X = xs.at0(i);

        double Y = ys.at0(i);

        if( !Double.isNaN(X) && !Double.isNaN(Y)) {

          _sumX += X;

          _sumY += Y;

          _sumX2+= X*X;

          _n++;

        }

      }

    }

    @Override public void reduce( CalcSumsTask lr1 ) {

      _sumX += lr1._sumX ;

      _sumY += lr1._sumY ;

      _sumX2+= lr1._sumX2;

      _n += lr1._n;

    }

  }


*3. High-level Goals, *
* Make the code read like close to a math dsl - easier to recruit math
folks to debug or spot errors.*
* - Autogen JIT friendly optimized code where need be. *

*- Minimize passes over data.*

*4. Other best practices,*

*  Separate input and output data formats from Algorithm.*

* Use primitives for better memory management.*
* Generate JSON and HTML API for easy testing & usability.*


[Reference]
https://github.com/0xdata/h2o/blob/master/src/main/java/hex/LR2.java

package hex;

import water.*;
import water.api.DocGen;
import water.fvec.*;
import water.util.RString;

public class LR2 extends Request2 {
  static final int API_WEAVER = 1; // This file has auto-gen'd doc & json
fields
  static public DocGen.FieldDoc[] DOC_FIELDS; // Initialized from Auto-Gen
code.

  // This Request supports the HTML 'GET' command, and this is the help text
  // for GET.
  static final String DOC_GET = "Linear Regression between 2 columns";

  @API(help="Data Frame", required=true, filter=Default.class)
  Frame source;

  @API(help="Column X", required=true, filter=LR2VecSelect.class)
  Vec vec_x;

  @API(help="Column Y", required=true, filter=LR2VecSelect.class)
  Vec vec_y;
  class LR2VecSelect extends VecSelect { LR2VecSelect() { super("source"); }
}

  @API(help="Pass 1 msec") long pass1time;
  @API(help="Pass 2 msec") long pass2time;
  @API(help="Pass 3 msec") long pass3time;
  @API(help="nrows") long nrows;
  @API(help="beta0") double beta0;
  @API(help="beta1") double beta1;
  @API(help="r-squared") double r2;
  @API(help="SSTO") double ssto;
  @API(help="SSE") double sse;
  @API(help="SSR") double ssr;
  @API(help="beta0 Std Error") double beta0stderr;
  @API(help="beta1 Std Error") double beta1stderr;

  @Override public Response serve() {
    // Pass 1: compute sums & sums-of-squares
    long start = System.currentTimeMillis();
    CalcSumsTask lr1 = new CalcSumsTask().doAll(vec_x, vec_y);
    long pass1 = System.currentTimeMillis();
    pass1time = pass1 - start;
    nrows = lr1._n;

    // Pass 2: Compute squared errors
    final double meanX = lr1._sumX/nrows;
    final double meanY = lr1._sumY/nrows;
    CalcSquareErrorsTasks lr2 = new CalcSquareErrorsTasks(meanX, meanY).
doAll(vec_x, vec_y);
    long pass2 = System.currentTimeMillis();
    pass2time = pass2 - pass1;
    ssto = lr2._YYbar;

    // Compute the regression
    beta1 = lr2._XYbar / lr2._XXbar;
    beta0 = meanY - beta1 * meanX;
    CalcRegressionTask lr3 = new CalcRegressionTask(beta0, beta1, meanY).
doAll(vec_x, vec_y);
    long pass3 = System.currentTimeMillis();
    pass3time = pass3 - pass2;

    long df = nrows - 2;
    r2 = lr3._ssr / lr2._YYbar;
    double svar = lr3._rss / df;
    double svar1 = svar / lr2._XXbar;
    double svar0 = svar/nrows + meanX*meanX*svar1;
    beta0stderr = Math.sqrt(svar0);
    beta1stderr = Math.sqrt(svar1);
    sse = lr3._rss;
    ssr = lr3._ssr;

    return Response.done(this);
  }

  public static class CalcSumsTask extends MRTask2<CalcSumsTask> {
    long _n; // Rows used
    double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
    @Override public void map( Chunk xs, Chunk ys ) {
      for( int i=0; i<xs._len; i++ ) {
        double X = xs.at0(i);
        double Y = ys.at0(i);
        if( !Double.isNaN(X) && !Double.isNaN(Y)) {
          _sumX += X;
          _sumY += Y;
          _sumX2+= X*X;
          _n++;
        }
      }
    }
    @Override public void reduce( CalcSumsTask lr1 ) {
      _sumX += lr1._sumX ;
      _sumY += lr1._sumY ;
      _sumX2+= lr1._sumX2;
      _n += lr1._n;
    }
  }


  public static class CalcSquareErrorsTasks extends MRTask2<
CalcSquareErrorsTasks> {
    final double _meanX, _meanY;
    double _XXbar, _YYbar, _XYbar;
    CalcSquareErrorsTasks( double meanX, double meanY ) { _meanX = meanX;
_meanY = meanY; }
    @Override public void map( Chunk xs, Chunk ys ) {
      for( int i=0; i<xs._len; i++ ) {
        double Xa = xs.at0(i);
        double Ya = ys.at0(i);
        if(!Double.isNaN(Xa) && !Double.isNaN(Ya)) {
          Xa -= _meanX;
          Ya -= _meanY;
          _XXbar += Xa*Xa;
          _YYbar += Ya*Ya;
          _XYbar += Xa*Ya;
        }
      }
    }
    @Override public void reduce( CalcSquareErrorsTasks lr2 ) {
      _XXbar += lr2._XXbar;
      _YYbar += lr2._YYbar;
      _XYbar += lr2._XYbar;
    }
  }


  public static class CalcRegressionTask extends MRTask2<CalcRegressionTask>
{
    final double _meanY;
    final double _beta0, _beta1;
    double _rss, _ssr;
    CalcRegressionTask(double beta0, double beta1, double meanY) {_beta0=
beta0; _beta1=beta1; _meanY=meanY;}
    @Override public void map( Chunk xs, Chunk ys ) {
      for( int i=0; i<xs._len; i++ ) {
        double X = xs.at0(i); double Y = ys.at0(i);
        if( !Double.isNaN(X) && !Double.isNaN(Y) ) {
          double fit = _beta1*X + _beta0;
          double rs = fit-Y;
          _rss += rs*rs;
          double sr = fit-_meanY;
          _ssr += sr*sr;
        }
      }
    }

    @Override public void reduce( CalcRegressionTask lr3 ) {
      _rss += lr3._rss;
      _ssr += lr3._ssr;
    }
  }

  /** Return the query link to this page */
  public static String link(Key k, String content) {
    RString rs = new RString("<a
href='LR2.query?data_key=%$key'>%content</a>");
    rs.replace("key", k.toString());
    rs.replace("content", content);
    return rs.toString();
  }
}


thanks, Sri

On Fri, Mar 14, 2014 at 9:39 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Love the architectural discussion but sometimes the real answers can be
> hidden by minutiae.
>
> Dimitriy is there enough running on Spark to compare to a DRM
> implementation on H2O? 0xdata, go ahead and implement DRM on H2O. If "the
> proof is in the pudding" why not compare?.
>
> We really ARE betting Mahout on H2O Ted. I don't buy your denial. If
> Mahout moves to another faster better execution engine is will do so only
> once in the immediate future. The only real alternative to your proposal is
> a call to action for committers to move Mahout to Spark or other more well
> known engine. These will realistically never coexist.
>
>
> Some other concerns:
>
> If H2O in only 2x as fast as Mahout on Spark I'd be dubious of adopting an
> unknown or unproven platform. The fact that it is custom made for BD
> Analytics is both good and bad. It means that expertise we develop for H2O
> may not be useful for other parallel computing problems. Also it seems from
> the docs that the design point for 0xdata is not the same as Mahout. 0xdata
> is trying to build a faster BD analytics platform (OLAP), not sparse data
> machine learning in daily production. None of the things I use in Mahout
> are in 0xdata, I suspect because of this mismatch. It doesn't mean it wont
> work but in lieu of the apples to apples comparison mentioned above it does
> worry me.
>
> On Mar 14, 2014, at 7:21 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>
> > I think that the proposal under discussion involves adding a dependency
> on
> > a maven released h2o artifact plus a contribution of Mahout translation
> > layers.  These layers would give a sub-class of Matrix (and Vector) which
> > allow direct control over life span across multiple jobs but would
> > otherwise behave like their in-memory counter-parts.
>
> Well I suppose that means they have to live in some processes which are not
> processes I already have. And they have to be managed. So this is not just
> an in-core subsystem. Sounds like a new back to me.
>
> >>
> >> In Hadoop, every iteration must be scheduled as a separate job, rereads
> >> invariant data and materializes its result to hdfs. Therefore, iterative
> >> programs on Hadoop are an order of magnitude slower than on systems that
> >> have dedicated support for iterations.
> >>
> >> Does h2o help here or would we need to incorporate another system for
> such
> >> tasks?
> >>
> >
> > H2o helps here in a couple of different ways.
> >
> > The first and foremost is that primitive operations are easy
> > Additionally, data elements can survive a single programs execution.
>  This
> > means that programs can be executed one after another to get composite
> > effects.  This is astonishingly fast ... more along the speeds one would
> > expect from a single processor program.
>
> I think the problem here is that the authors keep comparing these
> techniques to slowest model available which is Hadoop.
>
> But this is exact execution model of Spark. You get stuff repeatedly
> executed on in-memory partitions and get approximately the speed of
> iterative speed execution.  I won't describe it as astonishing, though,
> because indeed it is as fast as you can get things done in memory, no
> faster, no slower. That's for example the reason why my linalg optimizer is
> not hesitating to compute exact matrix geometry lazily if not known, for
> optimization purposes, because the answer will be back in between 40 to 200
> ms, assuming adequate RAM allocation. I have been using these paradigms for
> more than a year now. This is all good stuff. I would not use word
> astonshing, but sensible, yes. Main concern is if programming model is
> called to be sacrificed just to do sensible things here.
>
> >
>
> >> (2) Efficient join implementations
> >>
> >> If we look at a lot of Mahout's algorithm implementations with a
> database
> >> hat on, than we see lots of handcoded joins in our codebase, because
> Hadoop
> >> does not bring join primitives. This has lots of drawbacks, e.g. it
> >> complicates the codebase and leads to hardcoded join strategies that
> bake
> >> certain assumptions into the code (e.g. ALS uses a broadcast-join which
> >> assumes that one side fits into memory on each machine, RecommenderJob
> uses
> >> a repartition-join which is scalable but very slow for small
> inputs,...).
> >>
>
> +1
>
> > I think that h2o provides this but do not know in detail how.  I do know
> > that many of the algorithms already coded make use of matrix
> multiplication
> > which is essentially a join operation.
>
> Essentially a join? The spark module optimizer picks out of at least 3
> implementations: zip+combine, block-wise cartesian and finally, yes,
> join+combine. Depends on orientation and the earlier operators in pipeline.
> That's exactly my point about flexibility of programming model from the
> optimizer point of view.
>
> >
> >> Obviously, I'd love to get rid of handcoded joins and implement ML
> >> algorithms (which is hard enough on its own). Other systems help with
> this
> >> already. Spark, for example offers broadcast and repartition-join
> >> primitives, Stratosphere has a join primitive and an optimizer that
> >> automatically decides which join strategy to use, as well as a highly
> >> optimized hybrid hashjoin implementation that can gracefully go
> out-of-core
> >> under memory pressure.
> >>
> >
> > When you get into the realm of things on this level of sophistication, I
> > think that you have found the boundary where alternative foundations like
> > Spark and Stratosphere are better than h2o.  The novelty with h2o is the
> > hypothesis that a very large fraction of interesting ML algorithms can be
> > implemented without this power.  So far, this seems correct.
>
> Again, this is largely along the lines "let's make a library of few
> hand-optimized things". Which is noble, but -- I would argue -- not
> ambitious enough. Most of the distributed ML projects do just that. We
> should perhaps think along the lines what could be differentiating factor
> for us.
>
> Not that we should not care about performance. It should be, of course,
> *sensible*. (Our MR code base of course does not give us that, as u said,
> jumping off MR wagon is not even a question).
>
> If you can forgive me for drawing parallels here, it's a difference between
> something like Weka and R. Collection vs. platform _and_ collection induced
> by platform. Platform of course also positively feeds into the speed of
> collection growth directly.
>
> When i use R, i don't have code consisting of algorithms calls. That is,
> yes, it is doing off-the shelf use now and then, but it is far from being
> the only thing  it is doing. 95% of the things is as simple feature
> massaging. I place no value in R for providing GLM for me. Gosh, this
> particular offering is virtually hanging from anywhere these days.
>
> But i do place value into it for doing custom feature prep and for, for
> example being able to get 100 grad students to try their own k-means
> implementation in seconds.
>
> Why?
>
> There has been a lot of talk here about building community and
> contributions etc. Platform is what builds it, most directly and amazingly.
> I would go on a limb here and say that Spark and mlib are experiencing
> explosive growth of contributions not because it can do things with
> in-memory datasets (which is important, but like i said, is has been long
> since viewed no more than just sensible), but because of clarity of its
> programming model. I think we have seen a very solid evidence that clarity
> and richness of programming model was the thing that attracts communities.
>
> If we grade roughly (very roughly!) what we have today, I can easily argue
> that the acceptance levels follow the programming model very closely. e.g.
> if i try to sort project with distributed programming models by (my
> subjectively percieved) popularity, from bottom to top :
>
> ********
>
> Hadoop MapReduce -- ok i don't even know how to organize the critique here,
> too long of a list, almost nobody (but Mahout) does these things this way
> today. Certainly, none of my last 2 employers did.
>
> hive -- SQL like with severly constrained general programming language
> capabilities, not conducive to batches. Pretty much limits to ad-hoc
> exploration.
>
> Pig -- a bit better, can write batches, but extra functionality mixins
> (UDFs) are still a royal pain
>
> Cascading -- even easier, rich primitives, easy batches, some manual
> optimization of physical plan elements. One of the big cons is the
> limitation of a rigid dataset tuple structure,
>
> FlumeJava (Crunch in apache world) -- even better, but java closures are
> just plain ugly, zero "scriptability". Its community has been hurt a little
> bit because of the fact that it was a bit late to the show compared to
> others (e.g. cascading), but it leveled off quickly.
>
> Scala bindings for Cascading (Scalding) and FlumeJava -- better, hell, well
> better on the closure and FP front! But still not being native to scala
> from get go creates some miniature problems there.
>
> Spark -- i think is fair to say  the current community "king" above those
> all -- all the aforementioned platform model pains are eliminated, although
> on performance side i think there're still some pockets for improvement on
> cost-based optimization side of things.
>
> Stratosphere might be more interesting in this department, but I am not
> sure at this point if that necessarily will translate into performance
> benefits for ML.
>
> ********
>
> The first few things are using the same computing model underneath and
> essentially are having roughly the same performance. Yet there's clear
> variation in community and acceptance.
>
> In ML world, we are seeing approximately the same thing. The clearer the
> programming model and ease of integration in to the process, the wider the
> acceptance. I probably can pretty successfully argue that current most
> performant ML "thing" as it stands is GraphLab. And it is pretty
> comprehensive in problem coverage (I think it does cover e.g. recommender
> concerns greater than h2o and Mahout together, for example). But i can also
> pretty successfully argue it is being rejected a lot of time for being just
> a collection (which is, in addition, is hard to call from jvm, i.e.
> integration again). It is actually so bad, that people in my company would
> rather go back to 20 snow wired R servers than think of even entertaining
> an architecture including GraphLab component. (Yes, variance of this sample
> as high as it gets, just saying what i hear).
>
> So as a general guideline to solve the current ills, it would stand to
> reason to adopt platform priority and algorithm collection as a function of
> such platform, rather than collection as a function of few dedicated
> efforts. Yes -- it has to be *sensibly* performant -- but this does not
> have to be mostly a concern of the code in this project directly. Rather,
> it has to be a concern of the backs (i.e. dependencies) and our in-core
> support.
>
> Our pathological fear of being a performance scapegoat totally obscurs the
> fact that performance is mostly a function of the back and that we were
> riding on a wrong back for a long time. As long as we don't cling to a
> particular back, it shouldn't be a problem. What one would rather accept:
> being initially 5x slower than Graphlab (but on par with MLlib) but beat
> these on community support, or being on par but anemic in community? If 02
> platform feels the performance has been so important to sacrifice
> programming model, why they feel the need to join an apache project? After
> all, they have been an open project for a long time already and have built
> their own community, big or small. Spark has just now become a top-level
> apache project, and joined apache incubator mere 2 months ago and did not
> have any trouble attracting community outside Apache at all. Stratosphere
> is not even in Apache. Similarly, did it help Mahout to be in Apache to get
> anywhere close in community measurement to these? So this totally refutes
> the argument one has to be an Apache project to get its exclusive qualities
> highlighted. Perhaps in the end it is more about the importance of the
> qualities to the community and quality of contributions.
>
> A lot of this platform and programming model priority is probably easier to
> say than do, but some of linalg and data frame things are ridiculously easy
> though in terms of amount of effort. If i could do linalg optmizer with
> bindings for sparks with 2 nights a month, the same can be done for
> multiple backs and data frames in a jiffy. Well, the back should have a
> clear programming model of course as a prerequisite. Which brings us back
> to the issue of richness of distributed primitives.
>
>


-- 
ceo & co-founder, 0 <http://www.0xdata.com/>*x*data Inc
+1-408.316.8192

Reply via email to