Nice to meet, Andrew. Thanks for your hand wave. We are enjoying the warmth of a passionate community. I only meant it to emphasize fact that Mahout is the most popular entry point for ml users. There are several sophisticated algorithms in Mahout. Users love them. A fan of quite a few of them: ssvd, streaming k-means, co-occurence, cf, sim t-digest and others; Many a night was spent reading the Random Forest implementation, before moving onto the R version. It's not an accident that some of the successful Mahout algorithms have not been on H2O's first list of implementations. They have a devoted user base and we could speed them up by a lot in-memory.
I would love to hear of the usability improvements and make at least a few of them possible (both through the speed & interactivity we bring in.) The workflows of data science are just as important. Adhoc analysis is a big part of that experience and we focused on some of that in H2O; with R. A simple twitter-bootstrap autogen web api comes from a core JSON API in H2O. Some of which can be morphed for the cause. Ease of use and extensibility will win user mindshare. As a scalable platform for ML and a powerful interactive environment, w/polyglot interfaces into Python, Scala or R and Java, Mahout has the potential for becoming the linux for machine learning world. This is worth making happen, Sri On Thu, Mar 13, 2014 at 9:05 PM, Andrew Musselman < andrew.mussel...@gmail.com> wrote: > Thanks Sri; nice to meet you and thanks for the conversation. > > When you say "hello world" I presume you're emphasizing that Mahout is a > popular entry point for people seeking to join the field, rather than its > being simple or easy to pick up. > > We've been talking about ways to make Mahout easier to adopt and adapt to > as a tool, so if your team would be willing to pitch in on some of the > usability issues we have I think that would be welcome along with the > mathematics work that's already being discussed. > > > On Thu, Mar 13, 2014 at 8:10 PM, SriSatish Ambati <srisat...@0xdata.com > >wrote: > > > Mahout is the hello world of Machine Learning. It's still the first place > > many new users get exposed to algorithms on big data. Making that > > experience beautiful, accessible and value-driven will make machine > > learning ubiquitous and Mahout a movement to rival the success & utility > of > > say, lucene and hadoop. Our vision and motivation is to re-ignite the > > community & double down on the identical founding visions of Mahout and > > H2O. Under one umbrella, Mahout can power intelligent applications for > the > > enterprises and users. > > > > Creating great software is hard, creating passionate communities is > harder. > > Our belief is that a product is not complete without it's community. This > > convergence will make Mahout the principal platform for integrating > > multiple ways of mining insights from data. > > > > the whole is greater than the sum of the parts, > > Sri > > > > > > > > On Thu, Mar 13, 2014 at 6:13 PM, Dmitriy Lyubimov <dlie...@gmail.com> > > wrote: > > > > > PS and of course it all sounds like a well rounded project that exceeds > > > current Mahout capabilities (in mapreduce world anyway). So not the > least > > > question is why are you seeking integration with Mahout. Clearly that > > would > > > involve significant effort to do some things Mahout way. So what's the > > > motivation? > > > On Mar 13, 2014 6:08 PM, "Dmitriy Lyubimov" <dlie...@gmail.com> wrote: > > > > > > > Thank you, Cliff. > > > > > > > > Those things are pretty much clear. Most of the questions were more > > along > > > > the lines which of those wonderful things you intend to port to > Mahout, > > > and > > > > how you see these to stitch in with existing Mahout architecture. > > > > > > > > At least one of your users here reported it does not make sense to > run > > > > Mahout on all this, and at least two of us have trouble seeing how > such > > > > disassembly and reassembly might take place. What are your thoughts > on > > > this? > > > > How clearly you realize the reintegration roadmap? > > > > > > > > Will you intend to keep h2o platform around as a standalone project? > > > > > > > > Do you intend to contribute top-level algorithms as well? What are > your > > > > thoughts on interoperability of top level algorithms with other > > > > memory-based backends? > > > > > > > > Thank you. > > > > On Mar 13, 2014 3:50 PM, "Cliff Click" <ccli...@gmail.com> wrote: > > > > > > > >> There have been a lot of questions on the H2O architecture, I hope > to > > > >> answer the top-level ones here. > > > >> > > > >> > > > >> H2O is a fast & flexible engine. We talk about the MapReduce > > execution > > > >> flavor, because it's easy to explain, because it covers a lot of > > ground, > > > >> and because we're implemented a bunch of dense linear-algebra style > > > >> algorithms with it - but that's not the only thing we can do with > H2O, > > > nor > > > >> is it the only coding"style". > > > >> > > > >> > > > >> H2O is based on a number of layers, and is coded to at different > > layers > > > >> to best approach different tasks and objectives. > > > >> > > > >> * *In-memory K/V store layer*: H2O sports an in-memory > > > >> (not-persistent) in-memory K/V store, with **exact** (not lazy) > > > >> consistency semantics and transactions. Both reads and writes > are > > > >> fully (locally) cachable. Typical cache-hit latencies for both > are > > > >> around 150ns (that's **nanoseconds**) from a NonBlockingHashMap. > > > >> Let me repeat that: reads and writes go through a non-blocking hash > > > >> table - we do NOT suffer (CANNOT suffer) from a hot-blocks > problem. > > > >> Cache-misses obviously require a network hop, and the execution > > > >> times are totally driven by the size of data moved divided by > > > >> available bandwidth... and of course the results are cached. The > > > >> K/V store is currently used hold control state, all results, and > of > > > >> course the Big Data itself. You could certainly build a dandy > > > >> graph-based algorithm directly over the K/V store; that's been on > > > >> our long-term roadmap for awhile. > > > >> > > > >> * *A columnar-compressed distributed Big Data store layer*: Big > Data > > > >> is heavily (and losslessly) compressed - typically 2x to 4x > better > > > >> than GZIP on disk, (YMMV), and can be accessed like a Java > Array. - > > > >> a Giant greater-than-4billion-element distributed Java array. > H2O > > > >> guarantees that if the data is accessed linearly then the access > > > >> time will match what you can get out of C or Fortran - i.e., be > > > >> memory bandwidth bound, not CPU bound. You can access the array > > > >> (for both reads and writes) in any order, of course, but you get > > > >> strong speed guarantees for accessing in-order. You can do > pretty > > > >> much anything to an H2O array that you can do with a Java array, > > > >> although due to size/scale you'll probably want to access the > array > > > >> in a blatantly parallel style. > > > >> o */A note on compression/*: The data is decompressed > > Just-In-Time > > > >> strictly in CPU registers in the hot inner loops - and THIS > IS > > > >> FASTER than decompressing beforehand because most algorithms > > are > > > >> memory bandwidth bound. Moving a 32byte cacheline of > > compressed > > > >> data into CPU registers gets more data per-cache-miss than > > > >> moving 4 8-byte doubles. Decompression typically takes 2-4 > > > >> instructions of shift/scale/add per element, and is well > > covered > > > >> by the cache-miss costs. > > > >> o */A note on Big Data and GC/*: H2O keeps all our data **in > > > >> heap**, but in large arrays of Java primitives. Our > experience > > > >> shows that we run well, without GC issues, even *on very > large > > > >> heaps with the default collector*. We routinely test with > e.g. > > > >> heaps from 2G to 200G - and never see FullGC costs exceed a > few > > > >> seconds every now and then (depends on the rate of Big Data > > > >> writing going on). The normal Java object allocation used to > > > >> drive the system internally has a negligible GC load. We > keep > > > >> our data in-heap because its as fast as possible (memory > > > >> bandwidth limited), and easy to code (pure Java), and has no > > > >> interesting GC costs. Our GC tuning policy is: "only use the > > > >> -Xmx flag, set to the largest you can allow given the machine > > > >> resources". Take all the other GC defaults, they will work > > fine. > > > >> o */A note on Bigger Data (and GC)/*: We do a user-mode > > > >> swap-to-disk when the Java heap gets too full, i.e., you're > > > >> using more Big Data than physical DRAM. We won't die with a > GC > > > >> death-spiral, but we will degrade to out-of-core speeds. > We'll > > > >> go as fast as the disk will allow. > > > >> o */A note on data ingest/*/:/ We read data fully parallelized > > > >> from S3, HDFS, NFS, URI's, browser uploads, etc. We can > > > >> typically drive HDFS disk spindles to an interesting fraction > > of > > > >> what you can get from e.g. HDFS file-copy. We parse & > compress > > > >> (in parallel) a very generous notion of a CSV file (for > > > >> instance, Hive files are directly ingestable), and SVM light > > > >> files. We are planning on an RDD ingester - interactivity > with > > > >> other frameworks is in everybody's interest. > > > >> o */A note on sparse data/*: H2O sports about 15 different > > > >> compression schemes under the hood, including ones designed > to > > > >> compress sparse data. We happily import SVMLight without > ever > > > >> having the data "blow up" and still fully supporting the > > > >> array-access API, including speed guarantees. > > > >> o */A note on missing data/*: Most datasets have *missing* > > > >> elements, and most math algorithms deal with missing data > > > >> specially. H2O fully supports a notion of "NA" for all data, > > > >> including setting, testing, selecting in (or out), etc, and > > this > > > >> notion is woven through the data presentation layer. > > > >> o */A note on streaming data/*: H2O vectors can have data > > inserted > > > >> & removed (anywhere, in any order) continuously. In > > particular, > > > >> it's easy to add new data at the end and remove it from the > > > >> start - i.e., a build a large rolling dataset holding all the > > > >> elements that fit given a memory budget and a data flow-rate. > > > >> This has been on our roadmap for awhile, and needs only a little > > > >> more work to be fully functional. > > > >> > > > >> * */Light-weight Map/Reduce layer/*: Map/Reduce is a nice way to > > write > > > >> blatantly parallel code (although not the only way), and we > support > > > >> a particularly fast and efficient flavor. A Map maps Type A to > > Type > > > >> B, and a Reduce combines two Type B's into one Type B. Both > Types > > A > > > >> & B can be a combination of small-data (described as a Plain Old > > > >> Java Object, a POJO) and big-data (described as another Giant H2O > > > >> distributed array). Here's an example map from a Type A (a pair > of > > > >> columns), to a Type B (a POJO of Class MyMR holding various > sums): > > > >> > > > >> *new MyMR<MRTask> extends MRTask { > > > >> double sum0, sum1, sq_sum0; // Most things are > > > >> allowed here > > > >> @Override public void map( double d0, double d1 ) { > > > >> sum0 += d0; sum1 += d1; sq_sum0 += d0*d0; // Again most > > > >> any Java code here > > > >> } > > > >> @Override public void reduce( MyMR my ) { // Combine two > > > >> MyMRs together > > > >> sum0 += my.sum0; sum1 += my.sum1; sq_sum0 += my.sq_sum0; > > > >> } > > > >> }.doAll( Vec v0, Vec v1 ); // Invoke in-parallel distributed* > > > >> > > > >> This code will be distributed 'round the cluster, and run at > > > >> memory-bandwidth speeds (on compressed data!) with no further > ado. > > > >> There's a lot of mileage possible here that I'm only touching > > > >> lightly on. Filtering, subsetting, writing results into temp > > arrays > > > >> that are used on later next passes; uniques on billions of rows, > > > >> ddply-style group-by operations all work in this Map/Reduce > > > >> framework - and all work by writing plain old Java. > > > >> o */Scala, and a note on API cleanliness/*: We fully > acknowledge > > > >> Java's weaknesses here - this is the Java6 flavor coding > style; > > > >> Java7 style is nicer - but still not as nice as some other > > > >> languages. We fully embrace & support alternative syntax(s) > > > >> over our engine. In particular, we have an engineer working > on > > > >> a in-process Scala (amongst other) interfaces. We are > shifting > > > >> our focus now, from the excellent backend to the API > interface > > > >> side of things. This is a work-in-progress for us, and we > are > > > >> looking forward to much improvement over the next year. > > > >> > > > >> * *Pre-Baked Algorithms Layer*: We have the following algorithms > > > >> pre-baked, fully optimized and full-featured: Generalized Linear > > > >> Modeling, including Logistic Regression plus Gaussian, Gamma, > > > >> Poisson, and Tweedie distributions. Neural Nets. Random Forest > > > >> (that scales *out* to all the data in the cluster). Gradient > > > >> Boosted Machine (again, in-parallel & fully distributed). PCA. > > > >> KMeans (& variants). Quantiles (any quantile, computed *exactly* > in > > > >> milliseconds). All these algorithms support Confusion Matrices > > > >> (with adjustable thresholds), AUC & ROC metrics, incremental test > > > >> data-set results on partially trained models during the build > > > >> process. Within each algorithm, we support a full range of > options > > > >> that you'd find in the similar R or SAS package. > > > >> o */A note on some Mahout algorithms/*: We're clearly well > suited > > > >> to e.g. SSVD and Co-occurrence and have talked with Ted > Dunning > > > >> at length on how they would be implemented in H2O. > > > >> > > > >> * *REST / JSON / R / python / Excel / REPL*: The system is > externally > > > >> drivable via URLs/REST API calls, with JSON responses. We use > > > >> REST/JSON from Python to drive all our testing harness. We have > a > > > >> very nice R package with H2O integrated behind R - you can issue > R > > > >> commands to an H2O-backed R "data.frame" - and have all the Big > > Math > > > >> work on the Big Data in a cluster - including 90% of the typical > > > >> "data munging" workflow. This same REST/JSON interface also > works > > > >> with e.g. Excel (yes we have a demo) or shell scripts. We have > an > > > >> R-like language REPL. We have a pretty web-GUI over the > REST/JSON > > > >> layer, that is suitable for lightweight modeling tasks. > > > >> > > > >> > > > >> Cliff > > > >> > > > >> > > > >> > > > > > > > > > > > -- > > ceo & co-founder, 0 <http://www.0xdata.com/>*x*data Inc > > +1-408.316.8192 > > >