There have been a lot of questions on the H2O architecture, I hope to
answer the top-level ones here.
H2O is a fast & flexible engine. We talk about the MapReduce execution
flavor, because it's easy to explain, because it covers a lot of ground,
and because we're implemented a bunch of dense linear-algebra style
algorithms with it - but that's not the only thing we can do with H2O,
nor is it the only coding"style".
H2O is based on a number of layers, and is coded to at different layers
to best approach different tasks and objectives.
* *In-memory K/V store layer*: H2O sports an in-memory
(not-persistent) in-memory K/V store, with **exact** (not lazy)
consistency semantics and transactions. Both reads and writes are
fully (locally) cachable. Typical cache-hit latencies for both are
around 150ns (that's **nanoseconds**) from a NonBlockingHashMap.
Let me repeat that: reads and writes go through a non-blocking hash
table - we do NOT suffer (CANNOT suffer) from a hot-blocks problem.
Cache-misses obviously require a network hop, and the execution
times are totally driven by the size of data moved divided by
available bandwidth... and of course the results are cached. The
K/V store is currently used hold control state, all results, and of
course the Big Data itself. You could certainly build a dandy
graph-based algorithm directly over the K/V store; that's been on
our long-term roadmap for awhile.
* *A columnar-compressed distributed Big Data store layer*: Big Data
is heavily (and losslessly) compressed - typically 2x to 4x better
than GZIP on disk, (YMMV), and can be accessed like a Java Array. -
a Giant greater-than-4billion-element distributed Java array. H2O
guarantees that if the data is accessed linearly then the access
time will match what you can get out of C or Fortran - i.e., be
memory bandwidth bound, not CPU bound. You can access the array
(for both reads and writes) in any order, of course, but you get
strong speed guarantees for accessing in-order. You can do pretty
much anything to an H2O array that you can do with a Java array,
although due to size/scale you'll probably want to access the array
in a blatantly parallel style.
o */A note on compression/*: The data is decompressed Just-In-Time
strictly in CPU registers in the hot inner loops - and THIS IS
FASTER than decompressing beforehand because most algorithms are
memory bandwidth bound. Moving a 32byte cacheline of compressed
data into CPU registers gets more data per-cache-miss than
moving 4 8-byte doubles. Decompression typically takes 2-4
instructions of shift/scale/add per element, and is well covered
by the cache-miss costs.
o */A note on Big Data and GC/*: H2O keeps all our data **in
heap**, but in large arrays of Java primitives. Our experience
shows that we run well, without GC issues, even *on very large
heaps with the default collector*. We routinely test with e.g.
heaps from 2G to 200G - and never see FullGC costs exceed a few
seconds every now and then (depends on the rate of Big Data
writing going on). The normal Java object allocation used to
drive the system internally has a negligible GC load. We keep
our data in-heap because its as fast as possible (memory
bandwidth limited), and easy to code (pure Java), and has no
interesting GC costs. Our GC tuning policy is: "only use the
-Xmx flag, set to the largest you can allow given the machine
resources". Take all the other GC defaults, they will work fine.
o */A note on Bigger Data (and GC)/*: We do a user-mode
swap-to-disk when the Java heap gets too full, i.e., you're
using more Big Data than physical DRAM. We won't die with a GC
death-spiral, but we will degrade to out-of-core speeds. We'll
go as fast as the disk will allow.
o */A note on data ingest/*/:/ We read data fully parallelized
from S3, HDFS, NFS, URI's, browser uploads, etc. We can
typically drive HDFS disk spindles to an interesting fraction of
what you can get from e.g. HDFS file-copy. We parse & compress
(in parallel) a very generous notion of a CSV file (for
instance, Hive files are directly ingestable), and SVM light
files. We are planning on an RDD ingester - interactivity with
other frameworks is in everybody's interest.
o */A note on sparse data/*: H2O sports about 15 different
compression schemes under the hood, including ones designed to
compress sparse data. We happily import SVMLight without ever
having the data "blow up" and still fully supporting the
array-access API, including speed guarantees.
o */A note on missing data/*: Most datasets have *missing*
elements, and most math algorithms deal with missing data
specially. H2O fully supports a notion of "NA" for all data,
including setting, testing, selecting in (or out), etc, and this
notion is woven through the data presentation layer.
o */A note on streaming data/*: H2O vectors can have data inserted
& removed (anywhere, in any order) continuously. In particular,
it's easy to add new data at the end and remove it from the
start - i.e., a build a large rolling dataset holding all the
elements that fit given a memory budget and a data flow-rate.
This has been on our roadmap for awhile, and needs only a little
more work to be fully functional.
* */Light-weight Map/Reduce layer/*: Map/Reduce is a nice way to write
blatantly parallel code (although not the only way), and we support
a particularly fast and efficient flavor. A Map maps Type A to Type
B, and a Reduce combines two Type B's into one Type B. Both Types A
& B can be a combination of small-data (described as a Plain Old
Java Object, a POJO) and big-data (described as another Giant H2O
distributed array). Here's an example map from a Type A (a pair of
columns), to a Type B (a POJO of Class MyMR holding various sums):
*new MyMR<MRTask> extends MRTask {
double sum0, sum1, sq_sum0; // Most things are
allowed here
@Override public void map( double d0, double d1 ) {
sum0 += d0; sum1 += d1; sq_sum0 += d0*d0; // Again most
any Java code here
}
@Override public void reduce( MyMR my ) { // Combine two
MyMRs together
sum0 += my.sum0; sum1 += my.sum1; sq_sum0 += my.sq_sum0;
}
}.doAll( Vec v0, Vec v1 ); // Invoke in-parallel distributed*
This code will be distributed 'round the cluster, and run at
memory-bandwidth speeds (on compressed data!) with no further ado.
There's a lot of mileage possible here that I'm only touching
lightly on. Filtering, subsetting, writing results into temp arrays
that are used on later next passes; uniques on billions of rows,
ddply-style group-by operations all work in this Map/Reduce
framework - and all work by writing plain old Java.
o */Scala, and a note on API cleanliness/*: We fully acknowledge
Java's weaknesses here - this is the Java6 flavor coding style;
Java7 style is nicer - but still not as nice as some other
languages. We fully embrace & support alternative syntax(s)
over our engine. In particular, we have an engineer working on
a in-process Scala (amongst other) interfaces. We are shifting
our focus now, from the excellent backend to the API interface
side of things. This is a work-in-progress for us, and we are
looking forward to much improvement over the next year.
* *Pre-Baked Algorithms Layer*: We have the following algorithms
pre-baked, fully optimized and full-featured: Generalized Linear
Modeling, including Logistic Regression plus Gaussian, Gamma,
Poisson, and Tweedie distributions. Neural Nets. Random Forest
(that scales *out* to all the data in the cluster). Gradient
Boosted Machine (again, in-parallel & fully distributed). PCA.
KMeans (& variants). Quantiles (any quantile, computed *exactly* in
milliseconds). All these algorithms support Confusion Matrices
(with adjustable thresholds), AUC & ROC metrics, incremental test
data-set results on partially trained models during the build
process. Within each algorithm, we support a full range of options
that you'd find in the similar R or SAS package.
o */A note on some Mahout algorithms/*: We're clearly well suited
to e.g. SSVD and Co-occurrence and have talked with Ted Dunning
at length on how they would be implemented in H2O.
* *REST / JSON / R / python / Excel / REPL*: The system is externally
drivable via URLs/REST API calls, with JSON responses. We use
REST/JSON from Python to drive all our testing harness. We have a
very nice R package with H2O integrated behind R - you can issue R
commands to an H2O-backed R "data.frame" - and have all the Big Math
work on the Big Data in a cluster - including 90% of the typical
"data munging" workflow. This same REST/JSON interface also works
with e.g. Excel (yes we have a demo) or shell scripts. We have an
R-like language REPL. We have a pretty web-GUI over the REST/JSON
layer, that is suitable for lightweight modeling tasks.
Cliff