There have been a lot of questions on the H2O architecture, I hope to answer the top-level ones here.

H2O is a fast & flexible engine. We talk about the MapReduce execution flavor, because it's easy to explain, because it covers a lot of ground, and because we're implemented a bunch of dense linear-algebra style algorithms with it - but that's not the only thing we can do with H2O, nor is it the only coding"style".


H2O is based on a number of layers, and is coded to at different layers to best approach different tasks and objectives.

 * *In-memory K/V store layer*: H2O sports an in-memory
   (not-persistent) in-memory K/V store, with **exact** (not lazy)
   consistency semantics and transactions.  Both reads and writes are
   fully (locally) cachable.  Typical cache-hit latencies for both are
around 150ns (that's **nanoseconds**) from a NonBlockingHashMap. Let me repeat that: reads and writes go through a non-blocking hash table - we do NOT suffer (CANNOT suffer) from a hot-blocks problem. Cache-misses obviously require a network hop, and the execution
   times are totally driven by the size of data moved divided by
   available bandwidth... and of course the results are cached.  The
   K/V store is currently used hold control state, all results, and of
   course the Big Data itself.  You could certainly build a dandy
   graph-based algorithm directly over the K/V store; that's been on
   our long-term roadmap for awhile.

 * *A columnar-compressed distributed Big Data store layer*: Big Data
   is heavily (and losslessly) compressed - typically 2x to 4x better
   than GZIP on disk, (YMMV), and can be accessed like a Java Array. -
   a Giant greater-than-4billion-element distributed Java array.  H2O
   guarantees that if the data is accessed linearly then the access
   time will match what you can get out of C or Fortran - i.e., be
   memory bandwidth bound, not CPU bound.  You can access the array
   (for both reads and writes) in any order, of course, but you get
   strong speed guarantees for accessing in-order.  You can do pretty
   much anything to an H2O array that you can do with a Java array,
   although due to size/scale you'll probably want to access the array
   in a blatantly parallel style.
     o */A note on compression/*: The data is decompressed Just-In-Time
       strictly in CPU registers in the hot inner loops - and THIS IS
       FASTER than decompressing beforehand because most algorithms are
       memory bandwidth bound.  Moving a 32byte cacheline of compressed
       data into CPU registers gets more data per-cache-miss than
       moving 4 8-byte doubles. Decompression typically takes 2-4
       instructions of shift/scale/add per element, and is well covered
       by the cache-miss costs.
     o */A note on Big Data and GC/*: H2O keeps all our data **in
       heap**, but in large arrays of Java primitives. Our experience
       shows that we run well, without GC issues, even *on very large
       heaps with the default collector*. We routinely test with e.g.
       heaps from 2G to 200G - and never see FullGC costs exceed a few
       seconds every now and then (depends on the rate of Big Data
       writing going on). The normal Java object allocation used to
       drive the system internally has a negligible GC load.  We keep
       our data in-heap because its as fast as possible (memory
       bandwidth limited), and easy to code (pure Java), and has no
       interesting GC costs.  Our GC tuning policy is: "only use the
       -Xmx flag, set to the largest you can allow given the machine
       resources".  Take all the other GC defaults, they will work fine.
     o */A note on Bigger Data (and GC)/*: We do a user-mode
       swap-to-disk when the Java heap gets too full, i.e., you're
       using more Big Data than physical DRAM.  We won't die with a GC
       death-spiral, but we will degrade to out-of-core speeds. We'll
       go as fast as the disk will allow.
     o */A note on data ingest/*/:/ We read data fully parallelized
       from S3, HDFS, NFS, URI's, browser uploads, etc.  We can
       typically drive HDFS disk spindles to an interesting fraction of
       what you can get from e.g. HDFS file-copy.  We parse & compress
       (in parallel) a very generous notion of a CSV file (for
       instance, Hive files are directly ingestable), and SVM light
       files.  We are planning on an RDD ingester - interactivity with
       other frameworks is in everybody's interest.
     o */A note on sparse data/*: H2O sports about 15 different
       compression schemes under the hood, including ones designed to
       compress sparse data.  We happily import SVMLight without ever
       having the data "blow up" and still fully supporting the
       array-access API, including speed guarantees.
     o */A note on missing data/*: Most datasets have *missing*
       elements, and most math algorithms deal with missing data
       specially.  H2O fully supports a notion of "NA" for all data,
       including setting, testing, selecting in (or out), etc, and this
       notion is woven through the data presentation layer.
     o */A note on streaming data/*: H2O vectors can have data inserted
       & removed (anywhere, in any order) continuously.  In particular,
       it's easy to add new data at the end and remove it from the
       start - i.e., a build a large rolling dataset holding all the
elements that fit given a memory budget and a data flow-rate. This has been on our roadmap for awhile, and needs only a little
       more work to be fully functional.

 * */Light-weight Map/Reduce layer/*: Map/Reduce is a nice way to write
   blatantly parallel code (although not the only way), and we support
   a particularly fast and efficient flavor.  A Map maps Type A to Type
   B, and a Reduce combines two Type B's into one Type B.  Both Types A
   & B can be a combination of small-data (described as a Plain Old
   Java Object, a POJO) and big-data (described as another Giant H2O
   distributed array). Here's an example map from a Type A (a pair of
   columns), to a Type B (a POJO of Class MyMR holding various sums):

   *new MyMR<MRTask> extends MRTask {
        double sum0, sum1, sq_sum0;           // Most things are
   allowed here
        @Override public void map( double d0, double d1 ) {
          sum0 += d0;  sum1 += d1;  sq_sum0 += d0*d0;  // Again most
   any Java code here
        }
        @Override public void reduce( MyMR my ) {   // Combine two
   MyMRs together
          sum0 += my.sum0; sum1 += my.sum1; sq_sum0 += my.sq_sum0;
        }
      }.doAll( Vec v0, Vec v1 );  // Invoke in-parallel distributed*

   This code will be distributed 'round the cluster, and run at
memory-bandwidth speeds (on compressed data!) with no further ado. There's a lot of mileage possible here that I'm only touching
   lightly on.  Filtering, subsetting, writing results into temp arrays
   that are used on later next passes; uniques on billions of rows,
   ddply-style group-by operations all work in this Map/Reduce
   framework - and all work by writing plain old Java.
     o */Scala, and a note on API cleanliness/*: We fully acknowledge
       Java's weaknesses here - this is the Java6 flavor coding style;
       Java7 style is nicer - but still not as nice as some other
       languages.  We fully embrace & support alternative syntax(s)
       over our engine.  In particular, we have an engineer working on
       a in-process Scala (amongst other) interfaces.  We are shifting
       our focus now, from the excellent backend to the API interface
       side of things.  This is a work-in-progress for us, and we are
       looking forward to much improvement over the next year.

 * *Pre-Baked Algorithms Layer*: We have the following algorithms
   pre-baked, fully optimized and full-featured: Generalized Linear
   Modeling, including Logistic Regression plus Gaussian, Gamma,
   Poisson, and Tweedie distributions.  Neural Nets. Random Forest
   (that scales *out* to all the data in the cluster).  Gradient
Boosted Machine (again, in-parallel & fully distributed). PCA. KMeans (& variants). Quantiles (any quantile, computed *exactly* in
   milliseconds).  All these algorithms support Confusion Matrices
   (with adjustable thresholds), AUC & ROC metrics, incremental test
   data-set results on partially trained models during the build
   process. Within each algorithm, we support a full range of options
   that you'd find in the similar R or SAS package.
     o */A note on some Mahout algorithms/*: We're clearly well suited
       to e.g. SSVD and Co-occurrence and have talked with Ted Dunning
       at length on how they would be implemented in H2O.

 * *REST / JSON / R / python / Excel / REPL*: The system is externally
   drivable via URLs/REST API calls, with JSON responses.  We use
   REST/JSON from Python to drive all our testing harness.  We have a
   very nice R package with H2O integrated behind R - you can issue R
   commands to an H2O-backed R "data.frame" - and have all the Big Math
   work on the Big Data in a cluster - including 90% of the typical
   "data munging" workflow.  This same REST/JSON interface also works
   with e.g. Excel (yes we have a demo) or shell scripts.  We have an
   R-like language REPL.  We have a pretty web-GUI over the REST/JSON
   layer, that is suitable for lightweight modeling tasks.


Cliff


Reply via email to