"detailed description of h20's programming and execution model."
No *formal* documentation for this exists; been no time to write such a
thing.
There's easy-to-find slide-share & video talks. Here are two:
- http://www.infoq.com/presentations/api-memory-analytics
- http://www.infoq.com/interviews/click-0xdata
Summary:
- A high-performance in-memory K/V store (cache-hits are 150 nano's,
misses depend on network transfer times). Supports full JMM exact
semantics & transactions. Used to hold the Big Data & to control
computations
- Big Data support via Frames/Vecs/Chunks - see the above slides for
graphical overview; compression "is a implementation feature" but not
visible in the execution model except as speed or size constraints.
- A well-tuned data-ingestion system
- Map/Reduce coding style, uses Java 1.7's Fork/Join on a single-node,
but distributed across nodes. Maps are fine-grained F/J tasks and can
produce both a Big output (distributed parallel writing to Frames/Vecs)
and a Small output (anything in a POJO). Reductions are also
fine-grained, and happen anytime 2 maps are done... so separate
"reduction" phase. Not the hadoop M/R - no sort or shuffle steps,
everything in DRAM.
- REST/JSON access to most algo's & coding. Web browser/html over that.
- Internal DSL - A work-in-progress. Right now converts a subset of the
R language to AST's, then executes the AST's. Covers a fairly large
subset of the bulk/array operators in R, and expressions built thereof.
Includes 1st-class functions and e.g. GroupBy (ddply in R lingo).
Expressions like "|apply(someFrame,2,function(x){
ifelse(is.na(x),mean(x),x)})|" will replace NA's in "someFrame" with the
mean of the column. It's R syntax (or very close to R), not Scala.
Cliff
On 5/1/2014 10:13 AM, Dmitriy Lyubimov wrote:
I'd be happy to see a concept of how to bring the operations of the DSL
onto h20, as well as a detailed description of h20's programming and
execution model.
+1.
--sebastian