RE: Helping out on spark efforts

Saikat Kanjilal Sat, 03 May 2014 22:19:23 -0700

Me again :), added a subset of the definitions from the dplyr functionality to 
the integration API section as promised , examples include compute/filter/chain 
etc.    My next steps will be adding concrete examples underneath each of the 
newly created Integration APIs, at a high level here are the domain objects I 
am thinking will need to exist and be referenced in the DataFrame world:
DataFrame (self explanatory)Query (a generalized abstraction around a 
query==could represent a sql/nosql query or an hdfs query)RDD (important domain 
object that could be returned by one or more of our APIs)Destination (a remote 
data source, could be a table/ a location in hdfs etc)Connection (a remote 
database connection to use to perform transactional operations between the 
dataFrame and a remote database)
Had an additional thought, might we at some point want to operate on matrices 
and mathematically perform operations with matrices and dataFrames, would love 
to hear from committers as to whether this may be useful and I can add in APIs 
around this as well.
One thing that I've also been pondering is whether or how to handle errors in 
any of these APIs, one thought I had was to introduce a generalized error 
object that can be reused on all of the APIs, maybe something that contains a 
message and an error code or something similar, an alternative idea is to 
leverage something already existing in the spark bindings if possible.
Would love for folks to take a look through the APIs as I expand them and add 
more examples and leave comments on JIRA ticket, also I'm thinking that since 
the stuff around performing slicing/CRUD functionality around dataFrames is 
pretty commonly understood I may take those examples out and put more examples 
in around the APIs for dplyr and mltables.
Blog: http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
JIRA: https://issues.apache.org/jira/browse/MAHOUT-1490


Regards



> From: [email protected]
> To: [email protected]
> Subject: RE: Helping out on spark efforts
> Date: Sat, 3 May 2014 10:09:51 -0700
> 
> I've taken a stab at adding a subset of the functionality used by MLTable 
> operators into the blog on top of the R CRUD functionality I listed earlier 
> into the integration API section of the blog, please review and let me know 
> your thoughts, will be tackling the dplyr functionality next and adding that 
> in , blog is shown below, again please see the integration API section for 
> details:
> 
> http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
> 
> Look forward to hearing comments either on the list on the jira ticket itself:
> https://issues.apache.org/jira/browse/MAHOUT-1490
> Thanks in advance.
> 
> > Date: Wed, 30 Apr 2014 17:13:52 +0200
> > From: [email protected]
> > To: [email protected]; [email protected]
> > Subject: Re: Helping out on spark efforts
> > 
> > I think getting the design right for MAHOUT-1490 is tough. Dmitriy 
> > suggested to update the design example to Scala code and try to work in 
> > things that fit from dply from R and MLTable. I'd love to see such a 
> > design doc.
> > 
> > --sebastian
> > 
> > On 04/30/2014 05:02 PM, Ted Dunning wrote:
> > > +1 for foundations first.
> > >
> > > There are bunches of algorithms just behind that.  K-means.  SGD+Adagrad
> > > regression.  Autoencoders.  K-sparse encoding.  Lots of stuff.
> > >
> > >
> > >
> > > On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <[email protected]> 
> > > wrote:
> > >
> > >> I think you should concentrate on MAHOUT-1490, that is a highly important
> > >> task that will be the foundation for a lot of stuff to be built on top.
> > >> Let's focus on getting this thing right and then move on to other things.
> > >>
> > >> --sebastian
> > >>
> > >>
> > >> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
> > >>
> > >>> Sebastien/Dmitry,In looking through the current list of issues I didnt
> > >>> see other algorithms in mahout that are talked about being ported to 
> > >>> spark,
> > >>> I was wondering if there's any interest/need in porting or writing 
> > >>> things
> > >>> like LR/KMeans/SVM to use spark, I'd like to help out in this area while
> > >>> working on 1490.  Also are we planning to port the distributed versions 
> > >>> of
> > >>> taste to use spark as well at some point.
> > >>> Thanks in advance.
> > >>>
> > >>>
> > >>
> > >
> > 
>

RE: Helping out on spark efforts

Reply via email to