I can answer some of these. If we need more, others can chime in.
On Fri, Mar 14, 2014 at 12:13 AM, Sebastian Schelter <s...@apache.org> wrote: > Hi, > > to me one problem is that a couldn't find documentation that gives a > comprehensive picture of the programming and execution model of h2o. Documentation is definitely lacking. > I'd like to get answers to the following questions: > > Which operators does it offer, how those are combined to create programs > and how are those programs executed on the cluster? > I think that Cliff's email described the operators in terms of programming constructs. They are not operators in the dataflow sense, but more of a fast fork/join programming primitive (at the lowest level). > > Atm, it seems to me to h2o can do the following things on partitioned, > columnar stored data: > > * execute computations independently on all partitions (-> a map operation) > Yes. > * compute a global aggregate from the results of the map (-> a functional > reduce operation (not the Hadoop one)) > Yes. > * compute aggregates of grouped data, as long as the aggregates fit into > main memory (-> Hadoop's reduce operation) > Yes. But also, because partitions can be accessed from anywhere at moderate cost, you can get fairly efficient equivalent of shuffles. > One caveat though is there are more aspects that a system suitable for > machine learning should offer and I'm still trying to understand how much > we'd get from h2o in those terms. > > (1) Efficient execution of iterative programs. > > In Hadoop, every iteration must be scheduled as a separate job, rereads > invariant data and materializes its result to hdfs. Therefore, iterative > programs on Hadoop are an order of magnitude slower than on systems that > have dedicated support for iterations. > > Does h2o help here or would we need to incorporate another system for such > tasks? > H2o helps here in a couple of different ways. The first and foremost is that primitive operations are easy to chain within a single program. Moreover, these operations can be chained using conventional function/method structuring and conventional looping. This makes it easier to code. Additionally, data elements can survive a single programs execution. This means that programs can be executed one after another to get composite effects. This is astonishingly fast ... more along the speeds one would expect from a single processor program. Both of these help with iteration on data. The first helps with iteration as in a Lanczos algorithm. The second helps with iteration where data is read, then data is reduced to a nominal form, then a model is derived, then new data is scored and so on. > (2) Efficient join implementations > > If we look at a lot of Mahout's algorithm implementations with a database > hat on, than we see lots of handcoded joins in our codebase, because Hadoop > does not bring join primitives. This has lots of drawbacks, e.g. it > complicates the codebase and leads to hardcoded join strategies that bake > certain assumptions into the code (e.g. ALS uses a broadcast-join which > assumes that one side fits into memory on each machine, RecommenderJob uses > a repartition-join which is scalable but very slow for small inputs,...). > I think that h2o provides this but do not know in detail how. I do know that many of the algorithms already coded make use of matrix multiplication which is essentially a join operation. > Obviously, I'd love to get rid of handcoded joins and implement ML > algorithms (which is hard enough on its own). Other systems help with this > already. Spark, for example offers broadcast and repartition-join > primitives, Stratosphere has a join primitive and an optimizer that > automatically decides which join strategy to use, as well as a highly > optimized hybrid hashjoin implementation that can gracefully go out-of-core > under memory pressure. > When you get into the realm of things on this level of sophistication, I think that you have found the boundary where alternative foundations like Spark and Stratosphere are better than h2o. The novelty with h2o is the hypothesis that a very large fraction of interesting ML algorithms can be implemented without this power. So far, this seems correct. > I have a few more things in mind, but I guess these are the most important > ones right now. > > Thx, > Sebastian > > > > > On 03/14/2014 07:41 AM, Sri wrote: > >> Dmitriy, >> >> H2O is about bringing better algorithms to big data and now to Mahout. >> Users will be able use, access and extend the sophisticated high precision >> and richly featured algorithms. We are bit at loss (puzzled) at the >> comparisons with general purpose computing platforms - our core vision and >> philosophy is around scaling advanced algorithms (and we should use the >> best of breed computation platform for the problem at hand.) >> >> Sri >> >> On Mar 13, 2014, at 18:08, Dmitriy Lyubimov <dlie...@gmail.com> wrote: >> >> Thank you, Cliff. >>> >>> Those things are pretty much clear. Most of the questions were more along >>> the lines which of those wonderful things you intend to port to Mahout, >>> and >>> how you see these to stitch in with existing Mahout architecture. >>> >>> At least one of your users here reported it does not make sense to run >>> Mahout on all this, and at least two of us have trouble seeing how such >>> disassembly and reassembly might take place. What are your thoughts on >>> this? >>> How clearly you realize the reintegration roadmap? >>> >>> Will you intend to keep h2o platform around as a standalone project? >>> >>> Do you intend to contribute top-level algorithms as well? What are your >>> thoughts on interoperability of top level algorithms with other >>> memory-based backends? >>> >>> Thank you. >>> On Mar 13, 2014 3:50 PM, "Cliff Click" <ccli...@gmail.com> wrote: >>> >>> There have been a lot of questions on the H2O architecture, I hope to >>>> answer the top-level ones here. >>>> >>>> >>>> H2O is a fast & flexible engine. We talk about the MapReduce execution >>>> flavor, because it's easy to explain, because it covers a lot of ground, >>>> and because we're implemented a bunch of dense linear-algebra style >>>> algorithms with it - but that's not the only thing we can do with H2O, >>>> nor >>>> is it the only coding"style". >>>> >>>> >>>> H2O is based on a number of layers, and is coded to at different layers >>>> to >>>> best approach different tasks and objectives. >>>> >>>> * *In-memory K/V store layer*: H2O sports an in-memory >>>> (not-persistent) in-memory K/V store, with **exact** (not lazy) >>>> consistency semantics and transactions. Both reads and writes are >>>> fully (locally) cachable. Typical cache-hit latencies for both are >>>> around 150ns (that's **nanoseconds**) from a NonBlockingHashMap. >>>> Let >>>> me repeat that: reads and writes go through a non-blocking hash >>>> table - we do NOT suffer (CANNOT suffer) from a hot-blocks problem. >>>> Cache-misses obviously require a network hop, and the execution >>>> times are totally driven by the size of data moved divided by >>>> available bandwidth... and of course the results are cached. The >>>> K/V store is currently used hold control state, all results, and of >>>> course the Big Data itself. You could certainly build a dandy >>>> graph-based algorithm directly over the K/V store; that's been on >>>> our long-term roadmap for awhile. >>>> >>>> * *A columnar-compressed distributed Big Data store layer*: Big Data >>>> is heavily (and losslessly) compressed - typically 2x to 4x better >>>> than GZIP on disk, (YMMV), and can be accessed like a Java Array. - >>>> a Giant greater-than-4billion-element distributed Java array. H2O >>>> guarantees that if the data is accessed linearly then the access >>>> time will match what you can get out of C or Fortran - i.e., be >>>> memory bandwidth bound, not CPU bound. You can access the array >>>> (for both reads and writes) in any order, of course, but you get >>>> strong speed guarantees for accessing in-order. You can do pretty >>>> much anything to an H2O array that you can do with a Java array, >>>> although due to size/scale you'll probably want to access the array >>>> in a blatantly parallel style. >>>> o */A note on compression/*: The data is decompressed Just-In-Time >>>> strictly in CPU registers in the hot inner loops - and THIS IS >>>> FASTER than decompressing beforehand because most algorithms are >>>> memory bandwidth bound. Moving a 32byte cacheline of compressed >>>> data into CPU registers gets more data per-cache-miss than >>>> moving 4 8-byte doubles. Decompression typically takes 2-4 >>>> instructions of shift/scale/add per element, and is well covered >>>> by the cache-miss costs. >>>> o */A note on Big Data and GC/*: H2O keeps all our data **in >>>> heap**, but in large arrays of Java primitives. Our experience >>>> shows that we run well, without GC issues, even *on very large >>>> heaps with the default collector*. We routinely test with e.g. >>>> heaps from 2G to 200G - and never see FullGC costs exceed a few >>>> seconds every now and then (depends on the rate of Big Data >>>> writing going on). The normal Java object allocation used to >>>> drive the system internally has a negligible GC load. We keep >>>> our data in-heap because its as fast as possible (memory >>>> bandwidth limited), and easy to code (pure Java), and has no >>>> interesting GC costs. Our GC tuning policy is: "only use the >>>> -Xmx flag, set to the largest you can allow given the machine >>>> resources". Take all the other GC defaults, they will work fine. >>>> o */A note on Bigger Data (and GC)/*: We do a user-mode >>>> swap-to-disk when the Java heap gets too full, i.e., you're >>>> using more Big Data than physical DRAM. We won't die with a GC >>>> death-spiral, but we will degrade to out-of-core speeds. We'll >>>> go as fast as the disk will allow. >>>> o */A note on data ingest/*/:/ We read data fully parallelized >>>> from S3, HDFS, NFS, URI's, browser uploads, etc. We can >>>> typically drive HDFS disk spindles to an interesting fraction of >>>> what you can get from e.g. HDFS file-copy. We parse & compress >>>> (in parallel) a very generous notion of a CSV file (for >>>> instance, Hive files are directly ingestable), and SVM light >>>> files. We are planning on an RDD ingester - interactivity with >>>> other frameworks is in everybody's interest. >>>> o */A note on sparse data/*: H2O sports about 15 different >>>> compression schemes under the hood, including ones designed to >>>> compress sparse data. We happily import SVMLight without ever >>>> having the data "blow up" and still fully supporting the >>>> array-access API, including speed guarantees. >>>> o */A note on missing data/*: Most datasets have *missing* >>>> elements, and most math algorithms deal with missing data >>>> specially. H2O fully supports a notion of "NA" for all data, >>>> including setting, testing, selecting in (or out), etc, and this >>>> notion is woven through the data presentation layer. >>>> o */A note on streaming data/*: H2O vectors can have data inserted >>>> & removed (anywhere, in any order) continuously. In particular, >>>> it's easy to add new data at the end and remove it from the >>>> start - i.e., a build a large rolling dataset holding all the >>>> elements that fit given a memory budget and a data flow-rate. >>>> This has been on our roadmap for awhile, and needs only a little >>>> more work to be fully functional. >>>> >>>> * */Light-weight Map/Reduce layer/*: Map/Reduce is a nice way to write >>>> blatantly parallel code (although not the only way), and we support >>>> a particularly fast and efficient flavor. A Map maps Type A to Type >>>> B, and a Reduce combines two Type B's into one Type B. Both Types A >>>> & B can be a combination of small-data (described as a Plain Old >>>> Java Object, a POJO) and big-data (described as another Giant H2O >>>> distributed array). Here's an example map from a Type A (a pair of >>>> columns), to a Type B (a POJO of Class MyMR holding various sums): >>>> >>>> *new MyMR<MRTask> extends MRTask { >>>> double sum0, sum1, sq_sum0; // Most things are >>>> allowed here >>>> @Override public void map( double d0, double d1 ) { >>>> sum0 += d0; sum1 += d1; sq_sum0 += d0*d0; // Again most >>>> any Java code here >>>> } >>>> @Override public void reduce( MyMR my ) { // Combine two >>>> MyMRs together >>>> sum0 += my.sum0; sum1 += my.sum1; sq_sum0 += my.sq_sum0; >>>> } >>>> }.doAll( Vec v0, Vec v1 ); // Invoke in-parallel distributed* >>>> >>>> This code will be distributed 'round the cluster, and run at >>>> memory-bandwidth speeds (on compressed data!) with no further ado. >>>> There's a lot of mileage possible here that I'm only touching >>>> lightly on. Filtering, subsetting, writing results into temp arrays >>>> that are used on later next passes; uniques on billions of rows, >>>> ddply-style group-by operations all work in this Map/Reduce >>>> framework - and all work by writing plain old Java. >>>> o */Scala, and a note on API cleanliness/*: We fully acknowledge >>>> Java's weaknesses here - this is the Java6 flavor coding style; >>>> Java7 style is nicer - but still not as nice as some other >>>> languages. We fully embrace & support alternative syntax(s) >>>> over our engine. In particular, we have an engineer working on >>>> a in-process Scala (amongst other) interfaces. We are shifting >>>> our focus now, from the excellent backend to the API interface >>>> side of things. This is a work-in-progress for us, and we are >>>> looking forward to much improvement over the next year. >>>> >>>> * *Pre-Baked Algorithms Layer*: We have the following algorithms >>>> pre-baked, fully optimized and full-featured: Generalized Linear >>>> Modeling, including Logistic Regression plus Gaussian, Gamma, >>>> Poisson, and Tweedie distributions. Neural Nets. Random Forest >>>> (that scales *out* to all the data in the cluster). Gradient >>>> Boosted Machine (again, in-parallel & fully distributed). PCA. >>>> KMeans (& variants). Quantiles (any quantile, computed *exactly* in >>>> milliseconds). All these algorithms support Confusion Matrices >>>> (with adjustable thresholds), AUC & ROC metrics, incremental test >>>> data-set results on partially trained models during the build >>>> process. Within each algorithm, we support a full range of options >>>> that you'd find in the similar R or SAS package. >>>> o */A note on some Mahout algorithms/*: We're clearly well suited >>>> to e.g. SSVD and Co-occurrence and have talked with Ted Dunning >>>> at length on how they would be implemented in H2O. >>>> >>>> * *REST / JSON / R / python / Excel / REPL*: The system is externally >>>> drivable via URLs/REST API calls, with JSON responses. We use >>>> REST/JSON from Python to drive all our testing harness. We have a >>>> very nice R package with H2O integrated behind R - you can issue R >>>> commands to an H2O-backed R "data.frame" - and have all the Big Math >>>> work on the Big Data in a cluster - including 90% of the typical >>>> "data munging" workflow. This same REST/JSON interface also works >>>> with e.g. Excel (yes we have a demo) or shell scripts. We have an >>>> R-like language REPL. We have a pretty web-GUI over the REST/JSON >>>> layer, that is suitable for lightweight modeling tasks. >>>> >>>> >>>> Cliff >>>> >>>> >>>> >>>> >