Me again :), added a subset of the definitions from the dplyr functionality to the integration API section as promised , examples include compute/filter/chain etc. My next steps will be adding concrete examples underneath each of the newly created Integration APIs, at a high level here are the domain objects I am thinking will need to exist and be referenced in the DataFrame world: DataFrame (self explanatory)Query (a generalized abstraction around a query==could represent a sql/nosql query or an hdfs query)RDD (important domain object that could be returned by one or more of our APIs)Destination (a remote data source, could be a table/ a location in hdfs etc)Connection (a remote database connection to use to perform transactional operations between the dataFrame and a remote database) Had an additional thought, might we at some point want to operate on matrices and mathematically perform operations with matrices and dataFrames, would love to hear from committers as to whether this may be useful and I can add in APIs around this as well. One thing that I've also been pondering is whether or how to handle errors in any of these APIs, one thought I had was to introduce a generalized error object that can be reused on all of the APIs, maybe something that contains a message and an error code or something similar, an alternative idea is to leverage something already existing in the spark bindings if possible. Would love for folks to take a look through the APIs as I expand them and add more examples and leave comments on JIRA ticket, also I'm thinking that since the stuff around performing slicing/CRUD functionality around dataFrames is pretty commonly understood I may take those examples out and put more examples in around the APIs for dplyr and mltables. Blog: http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html JIRA: https://issues.apache.org/jira/browse/MAHOUT-1490
Regards > From: [email protected] > To: [email protected] > Subject: RE: Helping out on spark efforts > Date: Sat, 3 May 2014 10:09:51 -0700 > > I've taken a stab at adding a subset of the functionality used by MLTable > operators into the blog on top of the R CRUD functionality I listed earlier > into the integration API section of the blog, please review and let me know > your thoughts, will be tackling the dplyr functionality next and adding that > in , blog is shown below, again please see the integration API section for > details: > > http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html > > Look forward to hearing comments either on the list on the jira ticket itself: > https://issues.apache.org/jira/browse/MAHOUT-1490 > Thanks in advance. > > > Date: Wed, 30 Apr 2014 17:13:52 +0200 > > From: [email protected] > > To: [email protected]; [email protected] > > Subject: Re: Helping out on spark efforts > > > > I think getting the design right for MAHOUT-1490 is tough. Dmitriy > > suggested to update the design example to Scala code and try to work in > > things that fit from dply from R and MLTable. I'd love to see such a > > design doc. > > > > --sebastian > > > > On 04/30/2014 05:02 PM, Ted Dunning wrote: > > > +1 for foundations first. > > > > > > There are bunches of algorithms just behind that. K-means. SGD+Adagrad > > > regression. Autoencoders. K-sparse encoding. Lots of stuff. > > > > > > > > > > > > On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <[email protected]> > > > wrote: > > > > > >> I think you should concentrate on MAHOUT-1490, that is a highly important > > >> task that will be the foundation for a lot of stuff to be built on top. > > >> Let's focus on getting this thing right and then move on to other things. > > >> > > >> --sebastian > > >> > > >> > > >> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote: > > >> > > >>> Sebastien/Dmitry,In looking through the current list of issues I didnt > > >>> see other algorithms in mahout that are talked about being ported to > > >>> spark, > > >>> I was wondering if there's any interest/need in porting or writing > > >>> things > > >>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while > > >>> working on 1490. Also are we planning to port the distributed versions > > >>> of > > >>> taste to use spark as well at some point. > > >>> Thanks in advance. > > >>> > > >>> > > >> > > > > > >
