Re: 0xdata interested in contributing

SriSatish Ambati Thu, 13 Mar 2014 22:07:22 -0700

Nice to meet, Andrew. Thanks for your hand wave. We are enjoying the warmth
of a passionate community.
I only meant it to emphasize fact that Mahout is the most popular entry
point for ml users. There are several sophisticated algorithms in Mahout.
Users love them. A fan of quite a few of them: ssvd, streaming k-means,
co-occurence, cf, sim t-digest and others; Many a night was spent reading
the Random Forest implementation, before moving onto the R version. It's
not an accident that some of the successful Mahout algorithms have not been
on H2O's first list of implementations. They have a devoted user base and
we could speed them up by a lot in-memory.


I would love to hear of the usability improvements and make at least a few
of them possible (both through the speed & interactivity we bring in.) The
workflows of data science are just as important. Adhoc analysis is a big
part of that experience and we focused on some of that in H2O; with R. A
simple twitter-bootstrap autogen web api comes from a core JSON API in H2O.
Some of which can be morphed for the cause. Ease of use and extensibility
will win user mindshare.

As a scalable platform for ML and a powerful interactive environment,
w/polyglot interfaces into Python, Scala or R and Java,
Mahout has the potential for becoming the linux for machine learning world.

This is worth making happen, Sri

On Thu, Mar 13, 2014 at 9:05 PM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Thanks Sri; nice to meet you and thanks for the conversation.
>
> When you say "hello world" I presume you're emphasizing that Mahout is a
> popular entry point for people seeking to join the field, rather than its
> being simple or easy to pick up.
>
> We've been talking about ways to make Mahout easier to adopt and adapt to
> as a tool, so if your team would be willing to pitch in on some of the
> usability issues we have I think that would be welcome along with the
> mathematics work that's already being discussed.
>
>
> On Thu, Mar 13, 2014 at 8:10 PM, SriSatish Ambati <srisat...@0xdata.com
> >wrote:
>
> > Mahout is the hello world of Machine Learning. It's still the first place
> > many new users get exposed to algorithms on big data. Making that
> > experience beautiful, accessible and value-driven will make machine
> > learning ubiquitous and Mahout a movement to rival the success & utility
> of
> > say, lucene and hadoop. Our vision and motivation is to re-ignite the
> > community & double down on the identical founding visions of Mahout and
> > H2O. Under one umbrella, Mahout can power intelligent applications for
> the
> > enterprises and users.
> >
> > Creating great software is hard, creating passionate communities is
> harder.
> > Our belief is that a product is not complete without it's community. This
> > convergence will make Mahout the principal platform for integrating
> > multiple ways of mining insights from data.
> >
> > the whole is greater than the sum of the parts,
> > Sri
> >
> >
> >
> > On Thu, Mar 13, 2014 at 6:13 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> > > PS and of course it all sounds like a well rounded project that exceeds
> > > current Mahout capabilities (in mapreduce world anyway). So not the
> least
> > > question is why are you seeking integration with Mahout. Clearly that
> > would
> > > involve significant effort to do some things Mahout way. So what's the
> > > motivation?
> > > On Mar 13, 2014 6:08 PM, "Dmitriy Lyubimov" <dlie...@gmail.com> wrote:
> > >
> > > > Thank you, Cliff.
> > > >
> > > > Those things are pretty much clear. Most of the questions were more
> > along
> > > > the lines which of those wonderful things you intend to port to
> Mahout,
> > > and
> > > > how you see these to stitch in with existing Mahout architecture.
> > > >
> > > > At least one of your users here reported it does not make sense to
> run
> > > > Mahout on all this, and at least two of us have trouble seeing how
> such
> > > > disassembly and reassembly might take place. What are your thoughts
> on
> > > this?
> > > > How clearly you realize the reintegration roadmap?
> > > >
> > > > Will you intend to keep h2o platform around as a standalone project?
> > > >
> > > > Do you intend to contribute top-level algorithms as well? What are
> your
> > > > thoughts on interoperability of top level algorithms with other
> > > > memory-based backends?
> > > >
> > > > Thank you.
> > > > On Mar 13, 2014 3:50 PM, "Cliff Click" <ccli...@gmail.com> wrote:
> > > >
> > > >> There have been a lot of questions on the H2O architecture, I hope
> to
> > > >> answer the top-level ones here.
> > > >>
> > > >>
> > > >> H2O is a fast & flexible engine.  We talk about the MapReduce
> > execution
> > > >> flavor, because it's easy to explain, because it covers a lot of
> > ground,
> > > >> and because we're implemented a bunch of dense linear-algebra style
> > > >> algorithms with it - but that's not the only thing we can do with
> H2O,
> > > nor
> > > >> is it the only coding"style".
> > > >>
> > > >>
> > > >> H2O is based on a number of layers, and is coded to at different
> > layers
> > > >> to best approach different tasks and objectives.
> > > >>
> > > >>  * *In-memory K/V store layer*: H2O sports an in-memory
> > > >>    (not-persistent) in-memory K/V store, with **exact** (not lazy)
> > > >>    consistency semantics and transactions.  Both reads and writes
> are
> > > >>    fully (locally) cachable.  Typical cache-hit latencies for both
> are
> > > >>    around 150ns (that's **nanoseconds**) from a NonBlockingHashMap.
> > > >>  Let me repeat that: reads and writes go through a non-blocking hash
> > > >>    table - we do NOT suffer (CANNOT suffer) from a hot-blocks
> problem.
> > > >>  Cache-misses obviously require a network hop, and the execution
> > > >>    times are totally driven by the size of data moved divided by
> > > >>    available bandwidth... and of course the results are cached.  The
> > > >>    K/V store is currently used hold control state, all results, and
> of
> > > >>    course the Big Data itself.  You could certainly build a dandy
> > > >>    graph-based algorithm directly over the K/V store; that's been on
> > > >>    our long-term roadmap for awhile.
> > > >>
> > > >>  * *A columnar-compressed distributed Big Data store layer*: Big
> Data
> > > >>    is heavily (and losslessly) compressed - typically 2x to 4x
> better
> > > >>    than GZIP on disk, (YMMV), and can be accessed like a Java
> Array. -
> > > >>    a Giant greater-than-4billion-element distributed Java array.
>  H2O
> > > >>    guarantees that if the data is accessed linearly then the access
> > > >>    time will match what you can get out of C or Fortran - i.e., be
> > > >>    memory bandwidth bound, not CPU bound.  You can access the array
> > > >>    (for both reads and writes) in any order, of course, but you get
> > > >>    strong speed guarantees for accessing in-order.  You can do
> pretty
> > > >>    much anything to an H2O array that you can do with a Java array,
> > > >>    although due to size/scale you'll probably want to access the
> array
> > > >>    in a blatantly parallel style.
> > > >>      o */A note on compression/*: The data is decompressed
> > Just-In-Time
> > > >>        strictly in CPU registers in the hot inner loops - and THIS
> IS
> > > >>        FASTER than decompressing beforehand because most algorithms
> > are
> > > >>        memory bandwidth bound.  Moving a 32byte cacheline of
> > compressed
> > > >>        data into CPU registers gets more data per-cache-miss than
> > > >>        moving 4 8-byte doubles. Decompression typically takes 2-4
> > > >>        instructions of shift/scale/add per element, and is well
> > covered
> > > >>        by the cache-miss costs.
> > > >>      o */A note on Big Data and GC/*: H2O keeps all our data **in
> > > >>        heap**, but in large arrays of Java primitives. Our
> experience
> > > >>        shows that we run well, without GC issues, even *on very
> large
> > > >>        heaps with the default collector*. We routinely test with
> e.g.
> > > >>        heaps from 2G to 200G - and never see FullGC costs exceed a
> few
> > > >>        seconds every now and then (depends on the rate of Big Data
> > > >>        writing going on). The normal Java object allocation used to
> > > >>        drive the system internally has a negligible GC load.  We
> keep
> > > >>        our data in-heap because its as fast as possible (memory
> > > >>        bandwidth limited), and easy to code (pure Java), and has no
> > > >>        interesting GC costs.  Our GC tuning policy is: "only use the
> > > >>        -Xmx flag, set to the largest you can allow given the machine
> > > >>        resources".  Take all the other GC defaults, they will work
> > fine.
> > > >>      o */A note on Bigger Data (and GC)/*: We do a user-mode
> > > >>        swap-to-disk when the Java heap gets too full, i.e., you're
> > > >>        using more Big Data than physical DRAM.  We won't die with a
> GC
> > > >>        death-spiral, but we will degrade to out-of-core speeds.
> We'll
> > > >>        go as fast as the disk will allow.
> > > >>      o */A note on data ingest/*/:/ We read data fully parallelized
> > > >>        from S3, HDFS, NFS, URI's, browser uploads, etc.  We can
> > > >>        typically drive HDFS disk spindles to an interesting fraction
> > of
> > > >>        what you can get from e.g. HDFS file-copy.  We parse &
> compress
> > > >>        (in parallel) a very generous notion of a CSV file (for
> > > >>        instance, Hive files are directly ingestable), and SVM light
> > > >>        files.  We are planning on an RDD ingester - interactivity
> with
> > > >>        other frameworks is in everybody's interest.
> > > >>      o */A note on sparse data/*: H2O sports about 15 different
> > > >>        compression schemes under the hood, including ones designed
> to
> > > >>        compress sparse data.  We happily import SVMLight without
> ever
> > > >>        having the data "blow up" and still fully supporting the
> > > >>        array-access API, including speed guarantees.
> > > >>      o */A note on missing data/*: Most datasets have *missing*
> > > >>        elements, and most math algorithms deal with missing data
> > > >>        specially.  H2O fully supports a notion of "NA" for all data,
> > > >>        including setting, testing, selecting in (or out), etc, and
> > this
> > > >>        notion is woven through the data presentation layer.
> > > >>      o */A note on streaming data/*: H2O vectors can have data
> > inserted
> > > >>        & removed (anywhere, in any order) continuously.  In
> > particular,
> > > >>        it's easy to add new data at the end and remove it from the
> > > >>        start - i.e., a build a large rolling dataset holding all the
> > > >>        elements that fit given a memory budget and a data flow-rate.
> > > >>    This has been on our roadmap for awhile, and needs only a little
> > > >>        more work to be fully functional.
> > > >>
> > > >>  * */Light-weight Map/Reduce layer/*: Map/Reduce is a nice way to
> > write
> > > >>    blatantly parallel code (although not the only way), and we
> support
> > > >>    a particularly fast and efficient flavor.  A Map maps Type A to
> > Type
> > > >>    B, and a Reduce combines two Type B's into one Type B.  Both
> Types
> > A
> > > >>    & B can be a combination of small-data (described as a Plain Old
> > > >>    Java Object, a POJO) and big-data (described as another Giant H2O
> > > >>    distributed array). Here's an example map from a Type A (a pair
> of
> > > >>    columns), to a Type B (a POJO of Class MyMR holding various
> sums):
> > > >>
> > > >>    *new MyMR<MRTask> extends MRTask {
> > > >>         double sum0, sum1, sq_sum0;           // Most things are
> > > >>    allowed here
> > > >>         @Override public void map( double d0, double d1 ) {
> > > >>           sum0 += d0;  sum1 += d1;  sq_sum0 += d0*d0;  // Again most
> > > >>    any Java code here
> > > >>         }
> > > >>         @Override public void reduce( MyMR my ) {   // Combine two
> > > >>    MyMRs together
> > > >>           sum0 += my.sum0; sum1 += my.sum1; sq_sum0 += my.sq_sum0;
> > > >>         }
> > > >>       }.doAll( Vec v0, Vec v1 );  // Invoke in-parallel distributed*
> > > >>
> > > >>    This code will be distributed 'round the cluster, and run at
> > > >>    memory-bandwidth speeds (on compressed data!) with no further
> ado.
> > > >>  There's a lot of mileage possible here that I'm only touching
> > > >>    lightly on.  Filtering, subsetting, writing results into temp
> > arrays
> > > >>    that are used on later next passes; uniques on billions of rows,
> > > >>    ddply-style group-by operations all work in this Map/Reduce
> > > >>    framework - and all work by writing plain old Java.
> > > >>      o */Scala, and a note on API cleanliness/*: We fully
> acknowledge
> > > >>        Java's weaknesses here - this is the Java6 flavor coding
> style;
> > > >>        Java7 style is nicer - but still not as nice as some other
> > > >>        languages.  We fully embrace & support alternative syntax(s)
> > > >>        over our engine.  In particular, we have an engineer working
> on
> > > >>        a in-process Scala (amongst other) interfaces.  We are
> shifting
> > > >>        our focus now, from the excellent backend to the API
> interface
> > > >>        side of things.  This is a work-in-progress for us, and we
> are
> > > >>        looking forward to much improvement over the next year.
> > > >>
> > > >>  * *Pre-Baked Algorithms Layer*: We have the following algorithms
> > > >>    pre-baked, fully optimized and full-featured: Generalized Linear
> > > >>    Modeling, including Logistic Regression plus Gaussian, Gamma,
> > > >>    Poisson, and Tweedie distributions.  Neural Nets. Random Forest
> > > >>    (that scales *out* to all the data in the cluster).  Gradient
> > > >>    Boosted Machine (again, in-parallel & fully distributed).  PCA.
> > > >>  KMeans (& variants).  Quantiles (any quantile, computed *exactly*
> in
> > > >>    milliseconds).  All these algorithms support Confusion Matrices
> > > >>    (with adjustable thresholds), AUC & ROC metrics, incremental test
> > > >>    data-set results on partially trained models during the build
> > > >>    process. Within each algorithm, we support a full range of
> options
> > > >>    that you'd find in the similar R or SAS package.
> > > >>      o */A note on some Mahout algorithms/*: We're clearly well
> suited
> > > >>        to e.g. SSVD and Co-occurrence and have talked with Ted
> Dunning
> > > >>        at length on how they would be implemented in H2O.
> > > >>
> > > >>  * *REST / JSON / R / python / Excel / REPL*: The system is
> externally
> > > >>    drivable via URLs/REST API calls, with JSON responses.  We use
> > > >>    REST/JSON from Python to drive all our testing harness.  We have
> a
> > > >>    very nice R package with H2O integrated behind R - you can issue
> R
> > > >>    commands to an H2O-backed R "data.frame" - and have all the Big
> > Math
> > > >>    work on the Big Data in a cluster - including 90% of the typical
> > > >>    "data munging" workflow.  This same REST/JSON interface also
> works
> > > >>    with e.g. Excel (yes we have a demo) or shell scripts.  We have
> an
> > > >>    R-like language REPL.  We have a pretty web-GUI over the
> REST/JSON
> > > >>    layer, that is suitable for lightweight modeling tasks.
> > > >>
> > > >>
> > > >> Cliff
> > > >>
> > > >>
> > > >>
> > >
> >
> >
> >
> > --
> > ceo & co-founder, 0 <http://www.0xdata.com/>*x*data Inc
> > +1-408.316.8192
> >
>

Re: 0xdata interested in contributing

Reply via email to