Re: 0xdata interested in contributing

Ted Dunning Fri, 14 Mar 2014 00:38:22 -0700

I can answer some of these.  If we need more, others can chime in.


On Fri, Mar 14, 2014 at 12:13 AM, Sebastian Schelter <s...@apache.org> wrote:

> Hi,
>
> to me one problem is that a couldn't find documentation that gives a
> comprehensive picture of the programming and execution model of h2o.


Documentation is definitely lacking.


> I'd like to get answers to the following questions:
>
> Which operators does it offer, how those are combined to create programs
> and how are those programs executed on the cluster?
>

I think that Cliff's email described the operators in terms of programming
constructs.  They are not operators in the dataflow sense, but more of a
fast fork/join programming primitive (at the lowest level).


>
> Atm, it seems to me to h2o can do the following things on partitioned,
> columnar stored data:
>
> * execute computations independently on all partitions (-> a map operation)
>

Yes.


> * compute a global aggregate from the results of the map (-> a functional
> reduce operation (not the Hadoop one))
>

Yes.


> * compute aggregates of grouped data, as long as the aggregates fit into
> main memory (-> Hadoop's reduce operation)
>

Yes.

But also, because partitions can be accessed from anywhere at moderate
cost, you can get fairly efficient equivalent of shuffles.


> One caveat though is there are more aspects that a system suitable for
> machine learning should offer and I'm still trying to understand how much
> we'd get from h2o in those terms.
>
> (1) Efficient execution of iterative programs.
>
> In Hadoop, every iteration must be scheduled as a separate job, rereads
> invariant data and materializes its result to hdfs. Therefore, iterative
> programs on Hadoop are an order of magnitude slower than on systems that
> have dedicated support for iterations.
>
> Does h2o help here or would we need to incorporate another system for such
> tasks?
>

H2o helps here in a couple of different ways.

The first and foremost is that primitive operations are easy to chain
within a single program.  Moreover, these operations can be chained using
conventional function/method structuring and conventional looping.  This
makes it easier to code.

Additionally, data elements can survive a single programs execution.  This
means that programs can be executed one after another to get composite
effects.  This is astonishingly fast ... more along the speeds one would
expect from a single processor program.

Both of these help with iteration on data.  The first helps with iteration
as in a Lanczos algorithm.  The second helps with iteration where data is
read, then data is reduced to a nominal form, then a model is derived, then
new data is scored and so on.


> (2) Efficient join implementations
>
> If we look at a lot of Mahout's algorithm implementations with a database
> hat on, than we see lots of handcoded joins in our codebase, because Hadoop
> does not bring join primitives. This has lots of drawbacks, e.g. it
> complicates the codebase and leads to hardcoded join strategies that bake
> certain assumptions into the code (e.g. ALS uses a broadcast-join which
> assumes that one side fits into memory on each machine, RecommenderJob uses
> a repartition-join which is scalable but very slow for small inputs,...).
>

I think that h2o provides this but do not know in detail how.  I do know
that many of the algorithms already coded make use of matrix multiplication
which is essentially a join operation.


> Obviously, I'd love to get rid of handcoded joins and implement ML
> algorithms (which is hard enough on its own). Other systems help with this
> already. Spark, for example offers broadcast and repartition-join
> primitives, Stratosphere has a join primitive and an optimizer that
> automatically decides which join strategy to use, as well as a highly
> optimized hybrid hashjoin implementation that can gracefully go out-of-core
> under memory pressure.
>

When you get into the realm of things on this level of sophistication, I
think that you have found the boundary where alternative foundations like
Spark and Stratosphere are better than h2o.  The novelty with h2o is the
hypothesis that a very large fraction of interesting ML algorithms can be
implemented without this power.  So far, this seems correct.



> I have a few more things in mind, but I guess these are the most important
> ones right now.
>
> Thx,
> Sebastian
>
>
>
>
> On 03/14/2014 07:41 AM, Sri wrote:
>
>> Dmitriy,
>>
>> H2O is about bringing better algorithms to big data and now to Mahout.
>> Users will be able use, access and extend the sophisticated high precision
>> and richly featured algorithms. We are bit at loss (puzzled) at the
>> comparisons with general purpose computing platforms - our core vision and
>> philosophy is around scaling advanced algorithms (and we should use the
>> best of breed computation platform for the problem at hand.)
>>
>> Sri
>>
>> On Mar 13, 2014, at 18:08, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>>
>>  Thank you, Cliff.
>>>
>>> Those things are pretty much clear. Most of the questions were more along
>>> the lines which of those wonderful things you intend to port to Mahout,
>>> and
>>> how you see these to stitch in with existing Mahout architecture.
>>>
>>> At least one of your users here reported it does not make sense to run
>>> Mahout on all this, and at least two of us have trouble seeing how such
>>> disassembly and reassembly might take place. What are your thoughts on
>>> this?
>>> How clearly you realize the reintegration roadmap?
>>>
>>> Will you intend to keep h2o platform around as a standalone project?
>>>
>>> Do you intend to contribute top-level algorithms as well? What are your
>>> thoughts on interoperability of top level algorithms with other
>>> memory-based backends?
>>>
>>> Thank you.
>>> On Mar 13, 2014 3:50 PM, "Cliff Click" <ccli...@gmail.com> wrote:
>>>
>>>  There have been a lot of questions on the H2O architecture, I hope to
>>>> answer the top-level ones here.
>>>>
>>>>
>>>> H2O is a fast & flexible engine.  We talk about the MapReduce execution
>>>> flavor, because it's easy to explain, because it covers a lot of ground,
>>>> and because we're implemented a bunch of dense linear-algebra style
>>>> algorithms with it - but that's not the only thing we can do with H2O,
>>>> nor
>>>> is it the only coding"style".
>>>>
>>>>
>>>> H2O is based on a number of layers, and is coded to at different layers
>>>> to
>>>> best approach different tasks and objectives.
>>>>
>>>> * *In-memory K/V store layer*: H2O sports an in-memory
>>>>    (not-persistent) in-memory K/V store, with **exact** (not lazy)
>>>>    consistency semantics and transactions.  Both reads and writes are
>>>>    fully (locally) cachable.  Typical cache-hit latencies for both are
>>>>    around 150ns (that's **nanoseconds**) from a NonBlockingHashMap.
>>>>  Let
>>>> me repeat that: reads and writes go through a non-blocking hash
>>>>    table - we do NOT suffer (CANNOT suffer) from a hot-blocks problem.
>>>> Cache-misses obviously require a network hop, and the execution
>>>>    times are totally driven by the size of data moved divided by
>>>>    available bandwidth... and of course the results are cached.  The
>>>>    K/V store is currently used hold control state, all results, and of
>>>>    course the Big Data itself.  You could certainly build a dandy
>>>>    graph-based algorithm directly over the K/V store; that's been on
>>>>    our long-term roadmap for awhile.
>>>>
>>>> * *A columnar-compressed distributed Big Data store layer*: Big Data
>>>>    is heavily (and losslessly) compressed - typically 2x to 4x better
>>>>    than GZIP on disk, (YMMV), and can be accessed like a Java Array. -
>>>>    a Giant greater-than-4billion-element distributed Java array.  H2O
>>>>    guarantees that if the data is accessed linearly then the access
>>>>    time will match what you can get out of C or Fortran - i.e., be
>>>>    memory bandwidth bound, not CPU bound.  You can access the array
>>>>    (for both reads and writes) in any order, of course, but you get
>>>>    strong speed guarantees for accessing in-order.  You can do pretty
>>>>    much anything to an H2O array that you can do with a Java array,
>>>>    although due to size/scale you'll probably want to access the array
>>>>    in a blatantly parallel style.
>>>>      o */A note on compression/*: The data is decompressed Just-In-Time
>>>>        strictly in CPU registers in the hot inner loops - and THIS IS
>>>>        FASTER than decompressing beforehand because most algorithms are
>>>>        memory bandwidth bound.  Moving a 32byte cacheline of compressed
>>>>        data into CPU registers gets more data per-cache-miss than
>>>>        moving 4 8-byte doubles. Decompression typically takes 2-4
>>>>        instructions of shift/scale/add per element, and is well covered
>>>>        by the cache-miss costs.
>>>>      o */A note on Big Data and GC/*: H2O keeps all our data **in
>>>>        heap**, but in large arrays of Java primitives. Our experience
>>>>        shows that we run well, without GC issues, even *on very large
>>>>        heaps with the default collector*. We routinely test with e.g.
>>>>        heaps from 2G to 200G - and never see FullGC costs exceed a few
>>>>        seconds every now and then (depends on the rate of Big Data
>>>>        writing going on). The normal Java object allocation used to
>>>>        drive the system internally has a negligible GC load.  We keep
>>>>        our data in-heap because its as fast as possible (memory
>>>>        bandwidth limited), and easy to code (pure Java), and has no
>>>>        interesting GC costs.  Our GC tuning policy is: "only use the
>>>>        -Xmx flag, set to the largest you can allow given the machine
>>>>        resources".  Take all the other GC defaults, they will work fine.
>>>>      o */A note on Bigger Data (and GC)/*: We do a user-mode
>>>>        swap-to-disk when the Java heap gets too full, i.e., you're
>>>>        using more Big Data than physical DRAM.  We won't die with a GC
>>>>        death-spiral, but we will degrade to out-of-core speeds. We'll
>>>>        go as fast as the disk will allow.
>>>>      o */A note on data ingest/*/:/ We read data fully parallelized
>>>>        from S3, HDFS, NFS, URI's, browser uploads, etc.  We can
>>>>        typically drive HDFS disk spindles to an interesting fraction of
>>>>        what you can get from e.g. HDFS file-copy.  We parse & compress
>>>>        (in parallel) a very generous notion of a CSV file (for
>>>>        instance, Hive files are directly ingestable), and SVM light
>>>>        files.  We are planning on an RDD ingester - interactivity with
>>>>        other frameworks is in everybody's interest.
>>>>      o */A note on sparse data/*: H2O sports about 15 different
>>>>        compression schemes under the hood, including ones designed to
>>>>        compress sparse data.  We happily import SVMLight without ever
>>>>        having the data "blow up" and still fully supporting the
>>>>        array-access API, including speed guarantees.
>>>>      o */A note on missing data/*: Most datasets have *missing*
>>>>        elements, and most math algorithms deal with missing data
>>>>        specially.  H2O fully supports a notion of "NA" for all data,
>>>>        including setting, testing, selecting in (or out), etc, and this
>>>>        notion is woven through the data presentation layer.
>>>>      o */A note on streaming data/*: H2O vectors can have data inserted
>>>>        & removed (anywhere, in any order) continuously.  In particular,
>>>>        it's easy to add new data at the end and remove it from the
>>>>        start - i.e., a build a large rolling dataset holding all the
>>>>        elements that fit given a memory budget and a data flow-rate.
>>>> This has been on our roadmap for awhile, and needs only a little
>>>>        more work to be fully functional.
>>>>
>>>> * */Light-weight Map/Reduce layer/*: Map/Reduce is a nice way to write
>>>>    blatantly parallel code (although not the only way), and we support
>>>>    a particularly fast and efficient flavor.  A Map maps Type A to Type
>>>>    B, and a Reduce combines two Type B's into one Type B.  Both Types A
>>>>    & B can be a combination of small-data (described as a Plain Old
>>>>    Java Object, a POJO) and big-data (described as another Giant H2O
>>>>    distributed array). Here's an example map from a Type A (a pair of
>>>>    columns), to a Type B (a POJO of Class MyMR holding various sums):
>>>>
>>>>    *new MyMR<MRTask> extends MRTask {
>>>>         double sum0, sum1, sq_sum0;           // Most things are
>>>>    allowed here
>>>>         @Override public void map( double d0, double d1 ) {
>>>>           sum0 += d0;  sum1 += d1;  sq_sum0 += d0*d0;  // Again most
>>>>    any Java code here
>>>>         }
>>>>         @Override public void reduce( MyMR my ) {   // Combine two
>>>>    MyMRs together
>>>>           sum0 += my.sum0; sum1 += my.sum1; sq_sum0 += my.sq_sum0;
>>>>         }
>>>>       }.doAll( Vec v0, Vec v1 );  // Invoke in-parallel distributed*
>>>>
>>>>    This code will be distributed 'round the cluster, and run at
>>>>    memory-bandwidth speeds (on compressed data!) with no further ado.
>>>> There's a lot of mileage possible here that I'm only touching
>>>>    lightly on.  Filtering, subsetting, writing results into temp arrays
>>>>    that are used on later next passes; uniques on billions of rows,
>>>>    ddply-style group-by operations all work in this Map/Reduce
>>>>    framework - and all work by writing plain old Java.
>>>>      o */Scala, and a note on API cleanliness/*: We fully acknowledge
>>>>        Java's weaknesses here - this is the Java6 flavor coding style;
>>>>        Java7 style is nicer - but still not as nice as some other
>>>>        languages.  We fully embrace & support alternative syntax(s)
>>>>        over our engine.  In particular, we have an engineer working on
>>>>        a in-process Scala (amongst other) interfaces.  We are shifting
>>>>        our focus now, from the excellent backend to the API interface
>>>>        side of things.  This is a work-in-progress for us, and we are
>>>>        looking forward to much improvement over the next year.
>>>>
>>>> * *Pre-Baked Algorithms Layer*: We have the following algorithms
>>>>    pre-baked, fully optimized and full-featured: Generalized Linear
>>>>    Modeling, including Logistic Regression plus Gaussian, Gamma,
>>>>    Poisson, and Tweedie distributions.  Neural Nets. Random Forest
>>>>    (that scales *out* to all the data in the cluster).  Gradient
>>>>    Boosted Machine (again, in-parallel & fully distributed).  PCA.
>>>> KMeans (& variants).  Quantiles (any quantile, computed *exactly* in
>>>>    milliseconds).  All these algorithms support Confusion Matrices
>>>>    (with adjustable thresholds), AUC & ROC metrics, incremental test
>>>>    data-set results on partially trained models during the build
>>>>    process. Within each algorithm, we support a full range of options
>>>>    that you'd find in the similar R or SAS package.
>>>>      o */A note on some Mahout algorithms/*: We're clearly well suited
>>>>        to e.g. SSVD and Co-occurrence and have talked with Ted Dunning
>>>>        at length on how they would be implemented in H2O.
>>>>
>>>> * *REST / JSON / R / python / Excel / REPL*: The system is externally
>>>>    drivable via URLs/REST API calls, with JSON responses.  We use
>>>>    REST/JSON from Python to drive all our testing harness.  We have a
>>>>    very nice R package with H2O integrated behind R - you can issue R
>>>>    commands to an H2O-backed R "data.frame" - and have all the Big Math
>>>>    work on the Big Data in a cluster - including 90% of the typical
>>>>    "data munging" workflow.  This same REST/JSON interface also works
>>>>    with e.g. Excel (yes we have a demo) or shell scripts.  We have an
>>>>    R-like language REPL.  We have a pretty web-GUI over the REST/JSON
>>>>    layer, that is suitable for lightweight modeling tasks.
>>>>
>>>>
>>>> Cliff
>>>>
>>>>
>>>>
>>>>
>

Re: 0xdata interested in contributing

Reply via email to