Re: 0xdata interested in contributing

Andrew Musselman Thu, 13 Mar 2014 22:56:29 -0700

I think this could truly be the year of Mahout on the desktop! :)

As to usability, we're building tools to improve the usability -> help ->
fixes workflow, including parsers for the IRC history logs and for the user
and dev mailing lists, which feed into jobs to automate the FAQs for the
project.



On Thu, Mar 13, 2014 at 10:06 PM, SriSatish Ambati <srisat...@0xdata.com>wrote:

> Nice to meet, Andrew. Thanks for your hand wave. We are enjoying the warmth
> of a passionate community.
> I only meant it to emphasize fact that Mahout is the most popular entry
> point for ml users. There are several sophisticated algorithms in Mahout.
> Users love them. A fan of quite a few of them: ssvd, streaming k-means,
> co-occurence, cf, sim t-digest and others; Many a night was spent reading
> the Random Forest implementation, before moving onto the R version. It's
> not an accident that some of the successful Mahout algorithms have not been
> on H2O's first list of implementations. They have a devoted user base and
> we could speed them up by a lot in-memory.
>
> I would love to hear of the usability improvements and make at least a few
> of them possible (both through the speed & interactivity we bring in.) The
> workflows of data science are just as important. Adhoc analysis is a big
> part of that experience and we focused on some of that in H2O; with R. A
> simple twitter-bootstrap autogen web api comes from a core JSON API in H2O.
> Some of which can be morphed for the cause. Ease of use and extensibility
> will win user mindshare.
>
> As a scalable platform for ML and a powerful interactive environment,
> w/polyglot interfaces into Python, Scala or R and Java,
> Mahout has the potential for becoming the linux for machine learning world.
>
> This is worth making happen, Sri
>
> On Thu, Mar 13, 2014 at 9:05 PM, Andrew Musselman <
> andrew.mussel...@gmail.com> wrote:
>
> > Thanks Sri; nice to meet you and thanks for the conversation.
> >
> > When you say "hello world" I presume you're emphasizing that Mahout is a
> > popular entry point for people seeking to join the field, rather than its
> > being simple or easy to pick up.
> >
> > We've been talking about ways to make Mahout easier to adopt and adapt to
> > as a tool, so if your team would be willing to pitch in on some of the
> > usability issues we have I think that would be welcome along with the
> > mathematics work that's already being discussed.
> >
> >
> > On Thu, Mar 13, 2014 at 8:10 PM, SriSatish Ambati <srisat...@0xdata.com
> > >wrote:
> >
> > > Mahout is the hello world of Machine Learning. It's still the first
> place
> > > many new users get exposed to algorithms on big data. Making that
> > > experience beautiful, accessible and value-driven will make machine
> > > learning ubiquitous and Mahout a movement to rival the success &
> utility
> > of
> > > say, lucene and hadoop. Our vision and motivation is to re-ignite the
> > > community & double down on the identical founding visions of Mahout and
> > > H2O. Under one umbrella, Mahout can power intelligent applications for
> > the
> > > enterprises and users.
> > >
> > > Creating great software is hard, creating passionate communities is
> > harder.
> > > Our belief is that a product is not complete without it's community.
> This
> > > convergence will make Mahout the principal platform for integrating
> > > multiple ways of mining insights from data.
> > >
> > > the whole is greater than the sum of the parts,
> > > Sri
> > >
> > >
> > >
> > > On Thu, Mar 13, 2014 at 6:13 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > > wrote:
> > >
> > > > PS and of course it all sounds like a well rounded project that
> exceeds
> > > > current Mahout capabilities (in mapreduce world anyway). So not the
> > least
> > > > question is why are you seeking integration with Mahout. Clearly that
> > > would
> > > > involve significant effort to do some things Mahout way. So what's
> the
> > > > motivation?
> > > > On Mar 13, 2014 6:08 PM, "Dmitriy Lyubimov" <dlie...@gmail.com>
> wrote:
> > > >
> > > > > Thank you, Cliff.
> > > > >
> > > > > Those things are pretty much clear. Most of the questions were more
> > > along
> > > > > the lines which of those wonderful things you intend to port to
> > Mahout,
> > > > and
> > > > > how you see these to stitch in with existing Mahout architecture.
> > > > >
> > > > > At least one of your users here reported it does not make sense to
> > run
> > > > > Mahout on all this, and at least two of us have trouble seeing how
> > such
> > > > > disassembly and reassembly might take place. What are your thoughts
> > on
> > > > this?
> > > > > How clearly you realize the reintegration roadmap?
> > > > >
> > > > > Will you intend to keep h2o platform around as a standalone
> project?
> > > > >
> > > > > Do you intend to contribute top-level algorithms as well? What are
> > your
> > > > > thoughts on interoperability of top level algorithms with other
> > > > > memory-based backends?
> > > > >
> > > > > Thank you.
> > > > > On Mar 13, 2014 3:50 PM, "Cliff Click" <ccli...@gmail.com> wrote:
> > > > >
> > > > >> There have been a lot of questions on the H2O architecture, I hope
> > to
> > > > >> answer the top-level ones here.
> > > > >>
> > > > >>
> > > > >> H2O is a fast & flexible engine.  We talk about the MapReduce
> > > execution
> > > > >> flavor, because it's easy to explain, because it covers a lot of
> > > ground,
> > > > >> and because we're implemented a bunch of dense linear-algebra
> style
> > > > >> algorithms with it - but that's not the only thing we can do with
> > H2O,
> > > > nor
> > > > >> is it the only coding"style".
> > > > >>
> > > > >>
> > > > >> H2O is based on a number of layers, and is coded to at different
> > > layers
> > > > >> to best approach different tasks and objectives.
> > > > >>
> > > > >>  * *In-memory K/V store layer*: H2O sports an in-memory
> > > > >>    (not-persistent) in-memory K/V store, with **exact** (not lazy)
> > > > >>    consistency semantics and transactions.  Both reads and writes
> > are
> > > > >>    fully (locally) cachable.  Typical cache-hit latencies for both
> > are
> > > > >>    around 150ns (that's **nanoseconds**) from a
> NonBlockingHashMap.
> > > > >>  Let me repeat that: reads and writes go through a non-blocking
> hash
> > > > >>    table - we do NOT suffer (CANNOT suffer) from a hot-blocks
> > problem.
> > > > >>  Cache-misses obviously require a network hop, and the execution
> > > > >>    times are totally driven by the size of data moved divided by
> > > > >>    available bandwidth... and of course the results are cached.
>  The
> > > > >>    K/V store is currently used hold control state, all results,
> and
> > of
> > > > >>    course the Big Data itself.  You could certainly build a dandy
> > > > >>    graph-based algorithm directly over the K/V store; that's been
> on
> > > > >>    our long-term roadmap for awhile.
> > > > >>
> > > > >>  * *A columnar-compressed distributed Big Data store layer*: Big
> > Data
> > > > >>    is heavily (and losslessly) compressed - typically 2x to 4x
> > better
> > > > >>    than GZIP on disk, (YMMV), and can be accessed like a Java
> > Array. -
> > > > >>    a Giant greater-than-4billion-element distributed Java array.
> >  H2O
> > > > >>    guarantees that if the data is accessed linearly then the
> access
> > > > >>    time will match what you can get out of C or Fortran - i.e., be
> > > > >>    memory bandwidth bound, not CPU bound.  You can access the
> array
> > > > >>    (for both reads and writes) in any order, of course, but you
> get
> > > > >>    strong speed guarantees for accessing in-order.  You can do
> > pretty
> > > > >>    much anything to an H2O array that you can do with a Java
> array,
> > > > >>    although due to size/scale you'll probably want to access the
> > array
> > > > >>    in a blatantly parallel style.
> > > > >>      o */A note on compression/*: The data is decompressed
> > > Just-In-Time
> > > > >>        strictly in CPU registers in the hot inner loops - and THIS
> > IS
> > > > >>        FASTER than decompressing beforehand because most
> algorithms
> > > are
> > > > >>        memory bandwidth bound.  Moving a 32byte cacheline of
> > > compressed
> > > > >>        data into CPU registers gets more data per-cache-miss than
> > > > >>        moving 4 8-byte doubles. Decompression typically takes 2-4
> > > > >>        instructions of shift/scale/add per element, and is well
> > > covered
> > > > >>        by the cache-miss costs.
> > > > >>      o */A note on Big Data and GC/*: H2O keeps all our data **in
> > > > >>        heap**, but in large arrays of Java primitives. Our
> > experience
> > > > >>        shows that we run well, without GC issues, even *on very
> > large
> > > > >>        heaps with the default collector*. We routinely test with
> > e.g.
> > > > >>        heaps from 2G to 200G - and never see FullGC costs exceed a
> > few
> > > > >>        seconds every now and then (depends on the rate of Big Data
> > > > >>        writing going on). The normal Java object allocation used
> to
> > > > >>        drive the system internally has a negligible GC load.  We
> > keep
> > > > >>        our data in-heap because its as fast as possible (memory
> > > > >>        bandwidth limited), and easy to code (pure Java), and has
> no
> > > > >>        interesting GC costs.  Our GC tuning policy is: "only use
> the
> > > > >>        -Xmx flag, set to the largest you can allow given the
> machine
> > > > >>        resources".  Take all the other GC defaults, they will work
> > > fine.
> > > > >>      o */A note on Bigger Data (and GC)/*: We do a user-mode
> > > > >>        swap-to-disk when the Java heap gets too full, i.e., you're
> > > > >>        using more Big Data than physical DRAM.  We won't die with
> a
> > GC
> > > > >>        death-spiral, but we will degrade to out-of-core speeds.
> > We'll
> > > > >>        go as fast as the disk will allow.
> > > > >>      o */A note on data ingest/*/:/ We read data fully
> parallelized
> > > > >>        from S3, HDFS, NFS, URI's, browser uploads, etc.  We can
> > > > >>        typically drive HDFS disk spindles to an interesting
> fraction
> > > of
> > > > >>        what you can get from e.g. HDFS file-copy.  We parse &
> > compress
> > > > >>        (in parallel) a very generous notion of a CSV file (for
> > > > >>        instance, Hive files are directly ingestable), and SVM
> light
> > > > >>        files.  We are planning on an RDD ingester - interactivity
> > with
> > > > >>        other frameworks is in everybody's interest.
> > > > >>      o */A note on sparse data/*: H2O sports about 15 different
> > > > >>        compression schemes under the hood, including ones designed
> > to
> > > > >>        compress sparse data.  We happily import SVMLight without
> > ever
> > > > >>        having the data "blow up" and still fully supporting the
> > > > >>        array-access API, including speed guarantees.
> > > > >>      o */A note on missing data/*: Most datasets have *missing*
> > > > >>        elements, and most math algorithms deal with missing data
> > > > >>        specially.  H2O fully supports a notion of "NA" for all
> data,
> > > > >>        including setting, testing, selecting in (or out), etc, and
> > > this
> > > > >>        notion is woven through the data presentation layer.
> > > > >>      o */A note on streaming data/*: H2O vectors can have data
> > > inserted
> > > > >>        & removed (anywhere, in any order) continuously.  In
> > > particular,
> > > > >>        it's easy to add new data at the end and remove it from the
> > > > >>        start - i.e., a build a large rolling dataset holding all
> the
> > > > >>        elements that fit given a memory budget and a data
> flow-rate.
> > > > >>    This has been on our roadmap for awhile, and needs only a
> little
> > > > >>        more work to be fully functional.
> > > > >>
> > > > >>  * */Light-weight Map/Reduce layer/*: Map/Reduce is a nice way to
> > > write
> > > > >>    blatantly parallel code (although not the only way), and we
> > support
> > > > >>    a particularly fast and efficient flavor.  A Map maps Type A to
> > > Type
> > > > >>    B, and a Reduce combines two Type B's into one Type B.  Both
> > Types
> > > A
> > > > >>    & B can be a combination of small-data (described as a Plain
> Old
> > > > >>    Java Object, a POJO) and big-data (described as another Giant
> H2O
> > > > >>    distributed array). Here's an example map from a Type A (a pair
> > of
> > > > >>    columns), to a Type B (a POJO of Class MyMR holding various
> > sums):
> > > > >>
> > > > >>    *new MyMR<MRTask> extends MRTask {
> > > > >>         double sum0, sum1, sq_sum0;           // Most things are
> > > > >>    allowed here
> > > > >>         @Override public void map( double d0, double d1 ) {
> > > > >>           sum0 += d0;  sum1 += d1;  sq_sum0 += d0*d0;  // Again
> most
> > > > >>    any Java code here
> > > > >>         }
> > > > >>         @Override public void reduce( MyMR my ) {   // Combine two
> > > > >>    MyMRs together
> > > > >>           sum0 += my.sum0; sum1 += my.sum1; sq_sum0 += my.sq_sum0;
> > > > >>         }
> > > > >>       }.doAll( Vec v0, Vec v1 );  // Invoke in-parallel
> distributed*
> > > > >>
> > > > >>    This code will be distributed 'round the cluster, and run at
> > > > >>    memory-bandwidth speeds (on compressed data!) with no further
> > ado.
> > > > >>  There's a lot of mileage possible here that I'm only touching
> > > > >>    lightly on.  Filtering, subsetting, writing results into temp
> > > arrays
> > > > >>    that are used on later next passes; uniques on billions of
> rows,
> > > > >>    ddply-style group-by operations all work in this Map/Reduce
> > > > >>    framework - and all work by writing plain old Java.
> > > > >>      o */Scala, and a note on API cleanliness/*: We fully
> > acknowledge
> > > > >>        Java's weaknesses here - this is the Java6 flavor coding
> > style;
> > > > >>        Java7 style is nicer - but still not as nice as some other
> > > > >>        languages.  We fully embrace & support alternative
> syntax(s)
> > > > >>        over our engine.  In particular, we have an engineer
> working
> > on
> > > > >>        a in-process Scala (amongst other) interfaces.  We are
> > shifting
> > > > >>        our focus now, from the excellent backend to the API
> > interface
> > > > >>        side of things.  This is a work-in-progress for us, and we
> > are
> > > > >>        looking forward to much improvement over the next year.
> > > > >>
> > > > >>  * *Pre-Baked Algorithms Layer*: We have the following algorithms
> > > > >>    pre-baked, fully optimized and full-featured: Generalized
> Linear
> > > > >>    Modeling, including Logistic Regression plus Gaussian, Gamma,
> > > > >>    Poisson, and Tweedie distributions.  Neural Nets. Random Forest
> > > > >>    (that scales *out* to all the data in the cluster).  Gradient
> > > > >>    Boosted Machine (again, in-parallel & fully distributed).  PCA.
> > > > >>  KMeans (& variants).  Quantiles (any quantile, computed *exactly*
> > in
> > > > >>    milliseconds).  All these algorithms support Confusion Matrices
> > > > >>    (with adjustable thresholds), AUC & ROC metrics, incremental
> test
> > > > >>    data-set results on partially trained models during the build
> > > > >>    process. Within each algorithm, we support a full range of
> > options
> > > > >>    that you'd find in the similar R or SAS package.
> > > > >>      o */A note on some Mahout algorithms/*: We're clearly well
> > suited
> > > > >>        to e.g. SSVD and Co-occurrence and have talked with Ted
> > Dunning
> > > > >>        at length on how they would be implemented in H2O.
> > > > >>
> > > > >>  * *REST / JSON / R / python / Excel / REPL*: The system is
> > externally
> > > > >>    drivable via URLs/REST API calls, with JSON responses.  We use
> > > > >>    REST/JSON from Python to drive all our testing harness.  We
> have
> > a
> > > > >>    very nice R package with H2O integrated behind R - you can
> issue
> > R
> > > > >>    commands to an H2O-backed R "data.frame" - and have all the Big
> > > Math
> > > > >>    work on the Big Data in a cluster - including 90% of the
> typical
> > > > >>    "data munging" workflow.  This same REST/JSON interface also
> > works
> > > > >>    with e.g. Excel (yes we have a demo) or shell scripts.  We have
> > an
> > > > >>    R-like language REPL.  We have a pretty web-GUI over the
> > REST/JSON
> > > > >>    layer, that is suitable for lightweight modeling tasks.
> > > > >>
> > > > >>
> > > > >> Cliff
> > > > >>
> > > > >>
> > > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > ceo & co-founder, 0 <http://www.0xdata.com/>*x*data Inc
> > > +1-408.316.8192
> > >
> >
>

Re: 0xdata interested in contributing

Reply via email to