RE: Helping out on spark efforts

Saikat Kanjilal Sun, 04 May 2014 07:32:23 -0700

I'll add the example associated with Mahout-1518 in the integration API 
section, to be clear per the initial feedback I tried to "narrow the scope" of 
this effort by adding more examples around the dplyr and mltables functionality 
that I felt would be relevant towards the concept of a dataframe.   Are there 
other things missing from the APIs I am suggesting, would love to add them in 1 
fell swoop :))).  I wouldn't necessarily say the introduction of connection to 
a remote datasource and manipulating its contents inside a dataframe is 
distracting, infact dplyr is doing that now and I think it might be useful to 
take an RDD in the context of spark and bring a subset by applying a set of 
functions on top of that into a dataframe.
Keep feedback coming as you guys look through the API.


> Date: Sun, 4 May 2014 13:20:03 +0200
> From: [email protected]
> To: [email protected]
> Subject: Re: Helping out on spark efforts
> 
> I think we should concentrate on getting the core functionality right, 
> and test that on a few examples. We should narrow the scope of this and 
> avoid getting distracted by thinking about adding something generalizes 
> NoSQL-queries or so...
> 
> One thing that I would like to see is an example of how to handle the 
> input for a cooccurrence-based recommender in MAHOUT-1518
> 
> Say the raw data looks like this:
> 
> timestamp1, userIdString1, itemIdString1, “view"
> timestamp2, userIdString2, itemIdString1, “like"
> ...
> 
> 
> What we want in the end is two DRMs with int keys having users as rows 
> and items as columns. One DRM should contain all the views, the other 
> all the likes (e.g. for every userIdString, itemIdString pair present, 
> there is a 1 in the corresponding cell of the matrix).
> 
> The result of the cooccurrence analysis is a set of int-keyed item-item 
> matrices. We should be able to map the int keys back to the original 
> itemIdStrings.
> 
> Would love to see how that example looks like in your proposed DataFrame.
> 
> 
> --sebastian
> 
> 
> 
> On 05/04/2014 07:17 AM, Saikat Kanjilal wrote:
> > Me again :), added a subset of the definitions from the dplyr functionality 
> > to the integration API section as promised , examples include 
> > compute/filter/chain etc.    My next steps will be adding concrete examples 
> > underneath each of the newly created Integration APIs, at a high level here 
> > are the domain objects I am thinking will need to exist and be referenced 
> > in the DataFrame world:
> > DataFrame (self explanatory)Query (a generalized abstraction around a 
> > query==could represent a sql/nosql query or an hdfs query)RDD (important 
> > domain object that could be returned by one or more of our APIs)Destination 
> > (a remote data source, could be a table/ a location in hdfs etc)Connection 
> > (a remote database connection to use to perform transactional operations 
> > between the dataFrame and a remote database)
> > Had an additional thought, might we at some point want to operate on 
> > matrices and mathematically perform operations with matrices and 
> > dataFrames, would love to hear from committers as to whether this may be 
> > useful and I can add in APIs around this as well.
> > One thing that I've also been pondering is whether or how to handle errors 
> > in any of these APIs, one thought I had was to introduce a generalized 
> > error object that can be reused on all of the APIs, maybe something that 
> > contains a message and an error code or something similar, an alternative 
> > idea is to leverage something already existing in the spark bindings if 
> > possible.
> > Would love for folks to take a look through the APIs as I expand them and 
> > add more examples and leave comments on JIRA ticket, also I'm thinking that 
> > since the stuff around performing slicing/CRUD functionality around 
> > dataFrames is pretty commonly understood I may take those examples out and 
> > put more examples in around the APIs for dplyr and mltables.
> > Blog: 
> > http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
> > JIRA: https://issues.apache.org/jira/browse/MAHOUT-1490
> >
> > Regards
> >
> >
> >
> >> From: [email protected]
> >> To: [email protected]
> >> Subject: RE: Helping out on spark efforts
> >> Date: Sat, 3 May 2014 10:09:51 -0700
> >>
> >> I've taken a stab at adding a subset of the functionality used by MLTable 
> >> operators into the blog on top of the R CRUD functionality I listed 
> >> earlier into the integration API section of the blog, please review and 
> >> let me know your thoughts, will be tackling the dplyr functionality next 
> >> and adding that in , blog is shown below, again please see the integration 
> >> API section for details:
> >>
> >> http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
> >>
> >> Look forward to hearing comments either on the list on the jira ticket 
> >> itself:
> >> https://issues.apache.org/jira/browse/MAHOUT-1490
> >> Thanks in advance.
> >>
> >>> Date: Wed, 30 Apr 2014 17:13:52 +0200
> >>> From: [email protected]
> >>> To: [email protected]; [email protected]
> >>> Subject: Re: Helping out on spark efforts
> >>>
> >>> I think getting the design right for MAHOUT-1490 is tough. Dmitriy
> >>> suggested to update the design example to Scala code and try to work in
> >>> things that fit from dply from R and MLTable. I'd love to see such a
> >>> design doc.
> >>>
> >>> --sebastian
> >>>
> >>> On 04/30/2014 05:02 PM, Ted Dunning wrote:
> >>>> +1 for foundations first.
> >>>>
> >>>> There are bunches of algorithms just behind that.  K-means.  SGD+Adagrad
> >>>> regression.  Autoencoders.  K-sparse encoding.  Lots of stuff.
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <[email protected]> 
> >>>> wrote:
> >>>>
> >>>>> I think you should concentrate on MAHOUT-1490, that is a highly 
> >>>>> important
> >>>>> task that will be the foundation for a lot of stuff to be built on top.
> >>>>> Let's focus on getting this thing right and then move on to other 
> >>>>> things.
> >>>>>
> >>>>> --sebastian
> >>>>>
> >>>>>
> >>>>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
> >>>>>
> >>>>>> Sebastien/Dmitry,In looking through the current list of issues I didnt
> >>>>>> see other algorithms in mahout that are talked about being ported to 
> >>>>>> spark,
> >>>>>> I was wondering if there's any interest/need in porting or writing 
> >>>>>> things
> >>>>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area 
> >>>>>> while
> >>>>>> working on 1490.  Also are we planning to port the distributed 
> >>>>>> versions of
> >>>>>> taste to use spark as well at some point.
> >>>>>> Thanks in advance.
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>                                    
> >                                     
> >
>

RE: Helping out on spark efforts

Reply via email to