Re: Fwd: machine learning API, common models

Simone Robutti Mon, 30 May 2016 00:44:08 -0700

+1

2016-05-27 17:18 GMT+02:00 Kam Kasravi <[email protected]>:


> Hi Beam ML community
>
> Based on comments from a number of you and some discussion we've had here
> we thought we would suggest the following direction:
>
>    - Begin with primitive operations common and critical to most all ML
>    algorithms. These primitive operators would include:
>       - linear algebra operations - borrowing from established libraries
>       like samsara.
>       - iterative processing - also central to ML where replay of datasets
>       is easy to specific as well as thresholds or halting criteria. This
>       coordinates well with FlinkML's current approach and base API's.
>       - possibly new broadcast mechanisms not normally available within BSP
>       frameworks such as Beam.
>    - Normalize dataset and parameters that differ across current major ML
>    libraries that offer the same types of models.
>    - Favor a native ML implementation rather than a thin wrapper in order
>    to provide consistency across runners. This will also allow the Beam ML
> to
>    maximize quality and consistency issues across runners.
>    - Support for languages also supported in the Beam runners (java,
>    python, scala).
>    - Implement several common ML algorithms using the low level primitives
>    on one of more available Runners to validate both the low level API's
> and
>    possible improvements on the high level API.
>
> Skikit-learn pipelines and existing portable libraries like xgboost4j will
> be valuable to model the high-level APIs - for example how xgboost4j
> currently integrates with spark and flink.
>
> We welcome further comments and further refinements in approach.
>
> On Sun, May 22, 2016 at 7:43 PM, Henry Saputra <[email protected]>
> wrote:
>
> > @Frances:
> >
> > that would be probably the way to go IF we decide to have ML in Beam.
> >
> > @Simone:
> >
> > I am definitely love to see Beam introduce ML model APIs to abstract and
> > unifiy all "dataflow" runner frameworks, such as with Flink ML and Spark
> > ML.
> >
> > However, as you mentioned before, the target audience would be focus on
> > distributed or ML engineers as you have mentioned.
> > But I could see we have to then make some out of box ML algorithms (model
> > train and fine tune) in addition to test the model and APIs.
> >
> > The expectation would be that these models to be "production" ready, in
> > which most cases will be used by Data Scientists via some configurations,
> > since they won't and most can't use Java language.
> >
> > I would love to see instead more on integration with existing ML
> frameworks
> > like XGBoost [1], Mahout Samsara [2], or DL4J [3] for ML APIs and models
> in
> > Beam.
> >
> > Thoughts and comments are definitely welcomed =)
> >
> > - Henry
> >
> > [1] https://github.com/dmlc/xgboost
> > [2]
> https://mahout.apache.org/users/environment/out-of-core-reference.html
> > [3] http://deeplearning4j.org
> > <http://deeplearning4j.org/image-data-pipeline.html#record>
> >
> >
> > On Sat, May 21, 2016 at 2:01 AM, Simone Robutti <
> > [email protected]> wrote:
> >
> > > I think these APIs won't be used by Data Scientists (R, Python) but by
> > > Machine Learning Engineers (Scala, Java or C++ in different
> environments)
> > > and as a ML Engineer it makes a lot of sense to me to have such an API
> if
> > > I'm using Beam. It would make a lot more sense to implement algorithms
> > > directly in Beam but that will come in the future, I hope.
> > >
> > > 2016-05-21 0:35 GMT+02:00 Henry Saputra <[email protected]>:
> > >
> > > > I am a bit concern about adding ML model APIs to Beam because the
> > > fluctuate
> > > > nature of ML landscape and also in reality, most data scientists tend
> > to
> > > > use Python and R most the work with existing model definition.
> > > >
> > > > Even though you could say something like Spark ML is popular, it is
> > > merely
> > > > because it is involving Apache Spark rather than quality of the ML
> > module
> > > > itself.
> > > >
> > > > The pipeline and most of the tooling are inspired by scikit-learn,
> and
> > > > hence it is relying on familiarity of the library to attract
> > developers.
> > > >
> > > > My question is whether fully end to end ML APIs is needed as part of
> > core
> > > > Beam APIs.
> > > >
> > > > - Henry
> > > >
> > > > On Thu, May 19, 2016 at 5:46 AM, Jianfeng Qian <
> > [email protected]
> > > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > I am quite interested about this proposal.
> > > > > it is great to consider a lot of machine learning projects.
> > > > > Currently, most algorithms of spark mllib are batch processing,
> while
> > > > > oryx2 and streamDM focus on real-time machine learning.
> > > > > And Flink works with SAMOA team to integrate stream mining
> > algorithms,
> > > > too.
> > > > > So I wonder is that possible to design A flexible SDK which allow
> > user
> > > > > to call different third party packages or their own algorithms?
> > > > >
> > > > > Best,
> > > > > Jianfeng
> > > > >
> > > > > On 2016年05月17日 22:01, Suneel Marthi wrote:
> > > > > > Thanks Simone for pointing this out.
> > > > > >
> > > > > > On the Apache Mahout project we have distributed linear algebra
> > with
> > > > > R-like
> > > > > > semantics that can be executed on Spark/Flink/H2O.
> > > > > >
> > > > > > @Kam: the document u point out is old and outdated, the most
> > > up-to-date
> > > > > > reference to the Samsara api is the book - 'Apache Mahout: Beyond
> > > > > > MapReduce". (shameless marketing here on behalf of fellow
> > committers
> > > > :) )
> > > > > >
> > > > > > We added Flink DataSet API in the recent Mahout 0.12.0 release
> > (April
> > > > 11,
> > > > > > 2016) and has been called out in my talk at ApacheBigData in
> > > Vancouver
> > > > > last
> > > > > > week.
> > > > > >
> > > > > > The Mahout community would definitely be interested in being
> > involved
> > > > > with
> > > > > > this and sharing notes.
> > > > > >
> > > > > > IMHO, the focus should be first on building a good linalg
> > foundations
> > > > > > before embarking on building algos and pipelines. Adding
> @dlyubimov
> > > to
> > > > > this.
> > > > > >
> > > > > >
> > > > > >
> > > > > > ---------- Forwarded message ----------
> > > > > > From: Simone Robutti <[email protected]>
> > > > > > Date: Tue, May 17, 2016 at 9:48 AM
> > > > > > Subject: Fwd: machine learning API, common models
> > > > > > To: Suneel Marthi <[email protected]>
> > > > > >
> > > > > >
> > > > > >
> > > > > > ---------- Forwarded message ----------
> > > > > > From: Kavulya, Soila P <[email protected]>
> > > > > > Date: 2016-05-17 1:53 GMT+02:00
> > > > > > Subject: RE: machine learning API, common models
> > > > > > To: "[email protected]" <
> [email protected]
> > >
> > > > > >
> > > > > >
> > > > > > Thanks Simone,
> > > > > >
> > > > > > You have raised a valid concern about how different frameworks
> will
> > > > have
> > > > > > different implementations and parameter semantics for the same
> > > > > algorithm. I
> > > > > > agree that it is important to keep this in mind. Hopefully,
> through
> > > > this
> > > > > > exercise, we will identify a good set of common ML abstractions
> > > across
> > > > > > different frameworks.
> > > > > >
> > > > > > Feel free to edit the document. We had limited the first pass of
> > the
> > > > > > comparison matrix to the machine learning pipeline APIs, but we
> can
> > > > > extend
> > > > > > it to include other ML building blocks like linear algebra
> > > operations,
> > > > > and
> > > > > > APIs for optimizers like gradient descent.
> > > > > >
> > > > > > Soila
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Kam Kasravi [mailto:[email protected]]
> > > > > > Sent: Monday, May 16, 2016 8:22 AM
> > > > > > To: [email protected]
> > > > > > Subject: Re: machine learning API, common models
> > > > > >
> > > > > > Thanks Simone - yes I had read your concerns on dev and I think
> > > they're
> > > > > > well founded.
> > > > > > Thanks for the samsura reference - I've been looking at the
> > > spark/scala
> > > > > > bindings
> > > > >
> http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf
> > > > > > .
> > > > > >
> > > > > > I think we should expand the document to include linear algebraic
> > ops
> > > > or
> > > > > > least pay due diligence to it. If you're doing anything on the
> > flink
> > > > side
> > > > > > in this regard let us or feel free to suggest edits/updates to
> the
> > > > > document.
> > > > > >
> > > > > > Thanks
> > > > > > Kam
> > > > > >
> > > > > > On Mon, May 16, 2016 at 6:05 AM, Simone Robutti <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > >> Hello,
> > > > > >>
> > > > > >> I'm Simone and I just began contributing to Flink ML (actually
> on
> > > the
> > > > > >> distributed linalg part). I already expressed my concerns about
> > the
> > > > > >> idea of an high level API relying on specific frameworks'
> > > > > implementations:
> > > > > >> different implementations produce different results and may vary
> > in
> > > > > >> quality. Also the semantics of parameters may change from one
> > > > > >> implementation to the other. This could hinder portability and
> > > > > >> transparency. I believe these problems could be handled paying
> the
> > > due
> > > > > >> attention to the details of every single implementation but I
> > invite
> > > > > >> you not to underestimate these problems.
> > > > > >>
> > > > > >> On the other hand the API in itself looks good to me. From my
> > side,
> > > I
> > > > > >> hope to fill some of the gaps in Flink you underlined in the
> > > > comparison
> > > > > > matrix.
> > > > > >> Talking about matrices, proper matrices this time, I believe it
> > > would
> > > > > >> be useful to include in this API support for linear algebra
> > > > operations.
> > > > > >> Something similar is already present in Mahout's Samsara and it
> > > looks
> > > > > >> really good but clearly a similar implementation on Beam would
> be
> > > way
> > > > > >> more interesting and powerful.
> > > > > >>
> > > > > >> My 2 cents,
> > > > > >>
> > > > > >> Simone
> > > > > >>
> > > > > >>
> > > > > >> 2016-05-14 4:53 GMT+02:00 Kavulya, Soila P <
> > > [email protected]
> > > > >:
> > > > > >>
> > > > > >>> Hi Tyler,
> > > > > >>>
> > > > > >>> Thank you so much for your feedback. I agree that starting with
> > the
> > > > > >>> high-level API is a good direction. We are interested in Python
> > > > > >>> because
> > > > > >> it
> > > > > >>> is the language that our data scientists are most familiar
> with.
> > I
> > > > > >>> think starting with Java would be the best approach, because
> the
> > > > > >>> Python API can be a thin wrapper for Java API.
> > > > > >>>
> > > > > >>> In Spark, the Scala, Java and Python APIs are identical. Flink
> > does
> > > > > >>> not have a Python API for ML pipelines at present.
> > > > > >>>
> > > > > >>> Could you point me to the updated runner API?
> > > > > >>>
> > > > > >>> Soila
> > > > > >>>
> > > > > >>> -----Original Message-----
> > > > > >>> From: Tyler Akidau [mailto:[email protected]]
> > > > > >>> Sent: Friday, May 13, 2016 6:34 PM
> > > > > >>> To: [email protected]
> > > > > >>> Subject: Re: machine learning API, common models
> > > > > >>>
> > > > > >>> Hi Kam & Soila,
> > > > > >>>
> > > > > >>> Thanks a lot for writing this up. I ran the doc past some of
> the
> > > > > >>> folks who've been doing ML work here at Google, and they were
> > > > > >>> generally happy with the distillation of common methods in the
> > doc.
> > > > > >>> I'd be curious to
> > > > > >> hear
> > > > > >>> what folks on the Flink- and Spark- runner sides think.
> > > > > >>>
> > > > > >>> To me, this seems like a good direction for a high-level API.
> > > > > >>> Presumably, once a high-level API is in place, we could begin
> > > > > >>> looking at what it
> > > > > >> would
> > > > > >>> take to add lower-level ML algorithm support (e.g. iterative)
> to
> > > the
> > > > > >>> Beam Model. Is this essentially what you're thinking?
> > > > > >>>
> > > > > >>> Some more specific questions/comments:
> > > > > >>>
> > > > > >>>     - Presumably you'd want to tackle this in Java first, since
> > > > that's
> > > > > > the
> > > > > >>>     only language we currently support? Given that half of your
> > > > > >>> examples are in
> > > > > >>>     Python, I'm also assuming Python will be interesting once
> > it's
> > > > > >>> available.
> > > > > >>>
> > > > > >>>     - Along those lines, what languages are represented in the
> > > > > capability
> > > > > >>>     matrix? E.g. is Spark ML support as detailed there
> identical
> > > > across
> > > > > >>>     Java/Scala and Python?
> > > > > >>>
> > > > > >>>     - Have you thought about how this would tie in at the
> runner
> > > > level,
> > > > > >>>     particularly given the updated Runner API changes that are
> > > > coming?
> > > > > > I'm
> > > > > >>>     assuming they'd be provided as composite transforms that
> (for
> > > > > >>> now)
> > > > > >> would
> > > > > >>>     have no default implementation, given the lack of low-level
> > > > > >>> primitives for
> > > > > >>>     ML algorithms, but am curious what your thoughts are there.
> > > > > >>>
> > > > > >>>     - I still don't fully understand how incremental updates
> due
> > to
> > > > > model
> > > > > >>>     drift would tie in at the API level. There's a comment
> thread
> > > in
> > > > > >>> the
> > > > > >> doc
> > > > > >>>     still open tracking this, so no need to comment here
> > > > additionally.
> > > > > >> Just
> > > > > >>>     pointing it out as one of the things that stands out as
> > > > > >>> potentially having
> > > > > >>>     API-level impacts to me that doesn't seem 100% fleshed out
> in
> > > the
> > > > > >>> doc yet
> > > > > >>>     (thought that admittedly may just be my limited
> understanding
> > > at
> > > > > >>> this point
> > > > > >>>     :-).
> > > > > >>>
> > > > > >>> -Tyler
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> On Fri, May 13, 2016 at 10:48 AM Kam Kasravi <
> > [email protected]
> > > >
> > > > > >> wrote:
> > > > > >>>> Hi Tyler - my bad. Comments should be enabled now.
> > > > > >>>>
> > > > > >>>> On Fri, May 13, 2016 at 10:45 AM, Tyler Akidau
> > > > > >>>> <[email protected]
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>>> Thanks a lot, Kam. Can you please enable comment access on
> the
> > > doc?
> > > > > >>>>> I
> > > > > >>>> seem
> > > > > >>>>> to have view access only.
> > > > > >>>>>
> > > > > >>>>> -Tyler
> > > > > >>>>>
> > > > > >>>>> On Fri, May 13, 2016 at 9:54 AM Kam Kasravi
> > > > > >>>>> <[email protected]>
> > > > > >>>> wrote:
> > > > > >>>>>> Hi
> > > > > >>>>>>
> > > > > >>>>>> A number of readers have made comments on this topic
> recently.
> > > > > >>>>>> We have created a document that does some analysis of common
> > > > > >>>>>> ML models and
> > > > > >>>>> related
> > > > > >>>>>> APIs. We hope this can drive an approach that will result in
> > > > > >>>>>> an API, compatibility matrix and involvement from the same
> > > > > >>>>>> groups that are implementing transformation runners (spark,
> > > > > > flink, etc).
> > > > > >>>>>> We welcome comments here or in the document itself.
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>
> > > https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1
> > > > > >>>> yjo4
> > > > > >>>> PBECHb-xA/edit?usp=sharing
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fwd: machine learning API, common models

Reply via email to