+1 2016-05-27 17:18 GMT+02:00 Kam Kasravi <[email protected]>:
> Hi Beam ML community > > Based on comments from a number of you and some discussion we've had here > we thought we would suggest the following direction: > > - Begin with primitive operations common and critical to most all ML > algorithms. These primitive operators would include: > - linear algebra operations - borrowing from established libraries > like samsara. > - iterative processing - also central to ML where replay of datasets > is easy to specific as well as thresholds or halting criteria. This > coordinates well with FlinkML's current approach and base API's. > - possibly new broadcast mechanisms not normally available within BSP > frameworks such as Beam. > - Normalize dataset and parameters that differ across current major ML > libraries that offer the same types of models. > - Favor a native ML implementation rather than a thin wrapper in order > to provide consistency across runners. This will also allow the Beam ML > to > maximize quality and consistency issues across runners. > - Support for languages also supported in the Beam runners (java, > python, scala). > - Implement several common ML algorithms using the low level primitives > on one of more available Runners to validate both the low level API's > and > possible improvements on the high level API. > > Skikit-learn pipelines and existing portable libraries like xgboost4j will > be valuable to model the high-level APIs - for example how xgboost4j > currently integrates with spark and flink. > > We welcome further comments and further refinements in approach. > > On Sun, May 22, 2016 at 7:43 PM, Henry Saputra <[email protected]> > wrote: > > > @Frances: > > > > that would be probably the way to go IF we decide to have ML in Beam. > > > > @Simone: > > > > I am definitely love to see Beam introduce ML model APIs to abstract and > > unifiy all "dataflow" runner frameworks, such as with Flink ML and Spark > > ML. > > > > However, as you mentioned before, the target audience would be focus on > > distributed or ML engineers as you have mentioned. > > But I could see we have to then make some out of box ML algorithms (model > > train and fine tune) in addition to test the model and APIs. > > > > The expectation would be that these models to be "production" ready, in > > which most cases will be used by Data Scientists via some configurations, > > since they won't and most can't use Java language. > > > > I would love to see instead more on integration with existing ML > frameworks > > like XGBoost [1], Mahout Samsara [2], or DL4J [3] for ML APIs and models > in > > Beam. > > > > Thoughts and comments are definitely welcomed =) > > > > - Henry > > > > [1] https://github.com/dmlc/xgboost > > [2] > https://mahout.apache.org/users/environment/out-of-core-reference.html > > [3] http://deeplearning4j.org > > <http://deeplearning4j.org/image-data-pipeline.html#record> > > > > > > On Sat, May 21, 2016 at 2:01 AM, Simone Robutti < > > [email protected]> wrote: > > > > > I think these APIs won't be used by Data Scientists (R, Python) but by > > > Machine Learning Engineers (Scala, Java or C++ in different > environments) > > > and as a ML Engineer it makes a lot of sense to me to have such an API > if > > > I'm using Beam. It would make a lot more sense to implement algorithms > > > directly in Beam but that will come in the future, I hope. > > > > > > 2016-05-21 0:35 GMT+02:00 Henry Saputra <[email protected]>: > > > > > > > I am a bit concern about adding ML model APIs to Beam because the > > > fluctuate > > > > nature of ML landscape and also in reality, most data scientists tend > > to > > > > use Python and R most the work with existing model definition. > > > > > > > > Even though you could say something like Spark ML is popular, it is > > > merely > > > > because it is involving Apache Spark rather than quality of the ML > > module > > > > itself. > > > > > > > > The pipeline and most of the tooling are inspired by scikit-learn, > and > > > > hence it is relying on familiarity of the library to attract > > developers. > > > > > > > > My question is whether fully end to end ML APIs is needed as part of > > core > > > > Beam APIs. > > > > > > > > - Henry > > > > > > > > On Thu, May 19, 2016 at 5:46 AM, Jianfeng Qian < > > [email protected] > > > > > > > > wrote: > > > > > > > > > Hi, > > > > > I am quite interested about this proposal. > > > > > it is great to consider a lot of machine learning projects. > > > > > Currently, most algorithms of spark mllib are batch processing, > while > > > > > oryx2 and streamDM focus on real-time machine learning. > > > > > And Flink works with SAMOA team to integrate stream mining > > algorithms, > > > > too. > > > > > So I wonder is that possible to design A flexible SDK which allow > > user > > > > > to call different third party packages or their own algorithms? > > > > > > > > > > Best, > > > > > Jianfeng > > > > > > > > > > On 2016年05月17日 22:01, Suneel Marthi wrote: > > > > > > Thanks Simone for pointing this out. > > > > > > > > > > > > On the Apache Mahout project we have distributed linear algebra > > with > > > > > R-like > > > > > > semantics that can be executed on Spark/Flink/H2O. > > > > > > > > > > > > @Kam: the document u point out is old and outdated, the most > > > up-to-date > > > > > > reference to the Samsara api is the book - 'Apache Mahout: Beyond > > > > > > MapReduce". (shameless marketing here on behalf of fellow > > committers > > > > :) ) > > > > > > > > > > > > We added Flink DataSet API in the recent Mahout 0.12.0 release > > (April > > > > 11, > > > > > > 2016) and has been called out in my talk at ApacheBigData in > > > Vancouver > > > > > last > > > > > > week. > > > > > > > > > > > > The Mahout community would definitely be interested in being > > involved > > > > > with > > > > > > this and sharing notes. > > > > > > > > > > > > IMHO, the focus should be first on building a good linalg > > foundations > > > > > > before embarking on building algos and pipelines. Adding > @dlyubimov > > > to > > > > > this. > > > > > > > > > > > > > > > > > > > > > > > > ---------- Forwarded message ---------- > > > > > > From: Simone Robutti <[email protected]> > > > > > > Date: Tue, May 17, 2016 at 9:48 AM > > > > > > Subject: Fwd: machine learning API, common models > > > > > > To: Suneel Marthi <[email protected]> > > > > > > > > > > > > > > > > > > > > > > > > ---------- Forwarded message ---------- > > > > > > From: Kavulya, Soila P <[email protected]> > > > > > > Date: 2016-05-17 1:53 GMT+02:00 > > > > > > Subject: RE: machine learning API, common models > > > > > > To: "[email protected]" < > [email protected] > > > > > > > > > > > > > > > > > > > > > Thanks Simone, > > > > > > > > > > > > You have raised a valid concern about how different frameworks > will > > > > have > > > > > > different implementations and parameter semantics for the same > > > > > algorithm. I > > > > > > agree that it is important to keep this in mind. Hopefully, > through > > > > this > > > > > > exercise, we will identify a good set of common ML abstractions > > > across > > > > > > different frameworks. > > > > > > > > > > > > Feel free to edit the document. We had limited the first pass of > > the > > > > > > comparison matrix to the machine learning pipeline APIs, but we > can > > > > > extend > > > > > > it to include other ML building blocks like linear algebra > > > operations, > > > > > and > > > > > > APIs for optimizers like gradient descent. > > > > > > > > > > > > Soila > > > > > > > > > > > > -----Original Message----- > > > > > > From: Kam Kasravi [mailto:[email protected]] > > > > > > Sent: Monday, May 16, 2016 8:22 AM > > > > > > To: [email protected] > > > > > > Subject: Re: machine learning API, common models > > > > > > > > > > > > Thanks Simone - yes I had read your concerns on dev and I think > > > they're > > > > > > well founded. > > > > > > Thanks for the samsura reference - I've been looking at the > > > spark/scala > > > > > > bindings > > > > > > http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf > > > > > > . > > > > > > > > > > > > I think we should expand the document to include linear algebraic > > ops > > > > or > > > > > > least pay due diligence to it. If you're doing anything on the > > flink > > > > side > > > > > > in this regard let us or feel free to suggest edits/updates to > the > > > > > document. > > > > > > > > > > > > Thanks > > > > > > Kam > > > > > > > > > > > > On Mon, May 16, 2016 at 6:05 AM, Simone Robutti < > > > > > > [email protected]> wrote: > > > > > > > > > > > >> Hello, > > > > > >> > > > > > >> I'm Simone and I just began contributing to Flink ML (actually > on > > > the > > > > > >> distributed linalg part). I already expressed my concerns about > > the > > > > > >> idea of an high level API relying on specific frameworks' > > > > > implementations: > > > > > >> different implementations produce different results and may vary > > in > > > > > >> quality. Also the semantics of parameters may change from one > > > > > >> implementation to the other. This could hinder portability and > > > > > >> transparency. I believe these problems could be handled paying > the > > > due > > > > > >> attention to the details of every single implementation but I > > invite > > > > > >> you not to underestimate these problems. > > > > > >> > > > > > >> On the other hand the API in itself looks good to me. From my > > side, > > > I > > > > > >> hope to fill some of the gaps in Flink you underlined in the > > > > comparison > > > > > > matrix. > > > > > >> Talking about matrices, proper matrices this time, I believe it > > > would > > > > > >> be useful to include in this API support for linear algebra > > > > operations. > > > > > >> Something similar is already present in Mahout's Samsara and it > > > looks > > > > > >> really good but clearly a similar implementation on Beam would > be > > > way > > > > > >> more interesting and powerful. > > > > > >> > > > > > >> My 2 cents, > > > > > >> > > > > > >> Simone > > > > > >> > > > > > >> > > > > > >> 2016-05-14 4:53 GMT+02:00 Kavulya, Soila P < > > > [email protected] > > > > >: > > > > > >> > > > > > >>> Hi Tyler, > > > > > >>> > > > > > >>> Thank you so much for your feedback. I agree that starting with > > the > > > > > >>> high-level API is a good direction. We are interested in Python > > > > > >>> because > > > > > >> it > > > > > >>> is the language that our data scientists are most familiar > with. > > I > > > > > >>> think starting with Java would be the best approach, because > the > > > > > >>> Python API can be a thin wrapper for Java API. > > > > > >>> > > > > > >>> In Spark, the Scala, Java and Python APIs are identical. Flink > > does > > > > > >>> not have a Python API for ML pipelines at present. > > > > > >>> > > > > > >>> Could you point me to the updated runner API? > > > > > >>> > > > > > >>> Soila > > > > > >>> > > > > > >>> -----Original Message----- > > > > > >>> From: Tyler Akidau [mailto:[email protected]] > > > > > >>> Sent: Friday, May 13, 2016 6:34 PM > > > > > >>> To: [email protected] > > > > > >>> Subject: Re: machine learning API, common models > > > > > >>> > > > > > >>> Hi Kam & Soila, > > > > > >>> > > > > > >>> Thanks a lot for writing this up. I ran the doc past some of > the > > > > > >>> folks who've been doing ML work here at Google, and they were > > > > > >>> generally happy with the distillation of common methods in the > > doc. > > > > > >>> I'd be curious to > > > > > >> hear > > > > > >>> what folks on the Flink- and Spark- runner sides think. > > > > > >>> > > > > > >>> To me, this seems like a good direction for a high-level API. > > > > > >>> Presumably, once a high-level API is in place, we could begin > > > > > >>> looking at what it > > > > > >> would > > > > > >>> take to add lower-level ML algorithm support (e.g. iterative) > to > > > the > > > > > >>> Beam Model. Is this essentially what you're thinking? > > > > > >>> > > > > > >>> Some more specific questions/comments: > > > > > >>> > > > > > >>> - Presumably you'd want to tackle this in Java first, since > > > > that's > > > > > > the > > > > > >>> only language we currently support? Given that half of your > > > > > >>> examples are in > > > > > >>> Python, I'm also assuming Python will be interesting once > > it's > > > > > >>> available. > > > > > >>> > > > > > >>> - Along those lines, what languages are represented in the > > > > > capability > > > > > >>> matrix? E.g. is Spark ML support as detailed there > identical > > > > across > > > > > >>> Java/Scala and Python? > > > > > >>> > > > > > >>> - Have you thought about how this would tie in at the > runner > > > > level, > > > > > >>> particularly given the updated Runner API changes that are > > > > coming? > > > > > > I'm > > > > > >>> assuming they'd be provided as composite transforms that > (for > > > > > >>> now) > > > > > >> would > > > > > >>> have no default implementation, given the lack of low-level > > > > > >>> primitives for > > > > > >>> ML algorithms, but am curious what your thoughts are there. > > > > > >>> > > > > > >>> - I still don't fully understand how incremental updates > due > > to > > > > > model > > > > > >>> drift would tie in at the API level. There's a comment > thread > > > in > > > > > >>> the > > > > > >> doc > > > > > >>> still open tracking this, so no need to comment here > > > > additionally. > > > > > >> Just > > > > > >>> pointing it out as one of the things that stands out as > > > > > >>> potentially having > > > > > >>> API-level impacts to me that doesn't seem 100% fleshed out > in > > > the > > > > > >>> doc yet > > > > > >>> (thought that admittedly may just be my limited > understanding > > > at > > > > > >>> this point > > > > > >>> :-). > > > > > >>> > > > > > >>> -Tyler > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> On Fri, May 13, 2016 at 10:48 AM Kam Kasravi < > > [email protected] > > > > > > > > > >> wrote: > > > > > >>>> Hi Tyler - my bad. Comments should be enabled now. > > > > > >>>> > > > > > >>>> On Fri, May 13, 2016 at 10:45 AM, Tyler Akidau > > > > > >>>> <[email protected] > > > > > >>>> wrote: > > > > > >>>> > > > > > >>>>> Thanks a lot, Kam. Can you please enable comment access on > the > > > doc? > > > > > >>>>> I > > > > > >>>> seem > > > > > >>>>> to have view access only. > > > > > >>>>> > > > > > >>>>> -Tyler > > > > > >>>>> > > > > > >>>>> On Fri, May 13, 2016 at 9:54 AM Kam Kasravi > > > > > >>>>> <[email protected]> > > > > > >>>> wrote: > > > > > >>>>>> Hi > > > > > >>>>>> > > > > > >>>>>> A number of readers have made comments on this topic > recently. > > > > > >>>>>> We have created a document that does some analysis of common > > > > > >>>>>> ML models and > > > > > >>>>> related > > > > > >>>>>> APIs. We hope this can drive an approach that will result in > > > > > >>>>>> an API, compatibility matrix and involvement from the same > > > > > >>>>>> groups that are implementing transformation runners (spark, > > > > > > flink, etc). > > > > > >>>>>> We welcome comments here or in the document itself. > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>> > > > https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1 > > > > > >>>> yjo4 > > > > > >>>> PBECHb-xA/edit?usp=sharing > > > > > > > > > > > > > > > > > > > >
