RE: Helping out on spark efforts

Saikat Kanjilal Sun, 04 May 2014 08:54:06 -0700

> The aim of this issue is to get the initial design right for a slim, but 
> powerful dataframe. I talked about narrowing the scope w.r.t. to these 
> features that you proposed:
> 
> * "transactional operations between the dataFrame and a remote database"
> * introduction of a "a generalized abstraction around a query==could 
> represent a sql/nosql query or an hdfs query"
> 
***Understood, wil try to narrow down the scope even further and remove these 
assuming no-one else is interested.  I certainly understand the sentiment of 
getting something rolling with as narrow a focus as possible to make this 
successful, I'll keep working on the proposal based on your feedback and send 
periodic updates.


> Date: Sun, 4 May 2014 17:26:39 +0200
> From: [email protected]
> To: [email protected]
> Subject: Re: Helping out on spark efforts
> 
> Saikat,
> 
> The aim of this issue is to get the initial design right for a slim, but 
> powerful dataframe. I talked about narrowing the scope w.r.t. to these 
> features that you proposed:
> 
>   * "transactional operations between the dataFrame and a remote database"
>   * introduction of  a "a generalized abstraction around a query==could 
> represent a sql/nosql query or an hdfs query"
> 
> At this point, I would veto any patch that tries to address these things.
> 
> --sebastian
> 
> 
> On 05/04/2014 04:31 PM, Saikat Kanjilal wrote:
> > I'll add the example associated with Mahout-1518 in the integration API 
> > section, to be clear per the initial feedback I tried to "narrow the scope" 
> > of this effort by adding more examples around the dplyr and mltables 
> > functionality that I felt would be relevant towards the concept of a 
> > dataframe.   Are there other things missing from the APIs I am suggesting, 
> > would love to add them in 1 fell swoop :))).  I wouldn't necessarily say 
> > the introduction of connection to a remote datasource and manipulating its 
> > contents inside a dataframe is distracting, infact dplyr is doing that now 
> > and I think it might be useful to take an RDD in the context of spark and 
> > bring a subset by applying a set of functions on top of that into a 
> > dataframe.
> > Keep feedback coming as you guys look through the API.
> >
> >> Date: Sun, 4 May 2014 13:20:03 +0200
> >> From: [email protected]
> >> To: [email protected]
> >> Subject: Re: Helping out on spark efforts
> >>
> >> I think we should concentrate on getting the core functionality right,
> >> and test that on a few examples. We should narrow the scope of this and
> >> avoid getting distracted by thinking about adding something generalizes
> >> NoSQL-queries or so...
> >>
> >> One thing that I would like to see is an example of how to handle the
> >> input for a cooccurrence-based recommender in MAHOUT-1518
> >>
> >> Say the raw data looks like this:
> >>
> >> timestamp1, userIdString1, itemIdString1, “view"
> >> timestamp2, userIdString2, itemIdString1, “like"
> >> ...
> >>
> >>
> >> What we want in the end is two DRMs with int keys having users as rows
> >> and items as columns. One DRM should contain all the views, the other
> >> all the likes (e.g. for every userIdString, itemIdString pair present,
> >> there is a 1 in the corresponding cell of the matrix).
> >>
> >> The result of the cooccurrence analysis is a set of int-keyed item-item
> >> matrices. We should be able to map the int keys back to the original
> >> itemIdStrings.
> >>
> >> Would love to see how that example looks like in your proposed DataFrame.
> >>
> >>
> >> --sebastian
> >>
> >>
> >>
> >> On 05/04/2014 07:17 AM, Saikat Kanjilal wrote:
> >>> Me again :), added a subset of the definitions from the dplyr 
> >>> functionality to the integration API section as promised , examples 
> >>> include compute/filter/chain etc.    My next steps will be adding 
> >>> concrete examples underneath each of the newly created Integration APIs, 
> >>> at a high level here are the domain objects I am thinking will need to 
> >>> exist and be referenced in the DataFrame world:
> >>> DataFrame (self explanatory)Query (a generalized abstraction around a 
> >>> query==could represent a sql/nosql query or an hdfs query)RDD (important 
> >>> domain object that could be returned by one or more of our 
> >>> APIs)Destination (a remote data source, could be a table/ a location in 
> >>> hdfs etc)Connection (a remote database connection to use to perform 
> >>> transactional operations between the dataFrame and a remote database)
> >>> Had an additional thought, might we at some point want to operate on 
> >>> matrices and mathematically perform operations with matrices and 
> >>> dataFrames, would love to hear from committers as to whether this may be 
> >>> useful and I can add in APIs around this as well.
> >>> One thing that I've also been pondering is whether or how to handle 
> >>> errors in any of these APIs, one thought I had was to introduce a 
> >>> generalized error object that can be reused on all of the APIs, maybe 
> >>> something that contains a message and an error code or something similar, 
> >>> an alternative idea is to leverage something already existing in the 
> >>> spark bindings if possible.
> >>> Would love for folks to take a look through the APIs as I expand them and 
> >>> add more examples and leave comments on JIRA ticket, also I'm thinking 
> >>> that since the stuff around performing slicing/CRUD functionality around 
> >>> dataFrames is pretty commonly understood I may take those examples out 
> >>> and put more examples in around the APIs for dplyr and mltables.
> >>> Blog: 
> >>> http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
> >>> JIRA: https://issues.apache.org/jira/browse/MAHOUT-1490
> >>>
> >>> Regards
> >>>
> >>>
> >>>
> >>>> From: [email protected]
> >>>> To: [email protected]
> >>>> Subject: RE: Helping out on spark efforts
> >>>> Date: Sat, 3 May 2014 10:09:51 -0700
> >>>>
> >>>> I've taken a stab at adding a subset of the functionality used by 
> >>>> MLTable operators into the blog on top of the R CRUD functionality I 
> >>>> listed earlier into the integration API section of the blog, please 
> >>>> review and let me know your thoughts, will be tackling the dplyr 
> >>>> functionality next and adding that in , blog is shown below, again 
> >>>> please see the integration API section for details:
> >>>>
> >>>> http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html
> >>>>
> >>>> Look forward to hearing comments either on the list on the jira ticket 
> >>>> itself:
> >>>> https://issues.apache.org/jira/browse/MAHOUT-1490
> >>>> Thanks in advance.
> >>>>
> >>>>> Date: Wed, 30 Apr 2014 17:13:52 +0200
> >>>>> From: [email protected]
> >>>>> To: [email protected]; [email protected]
> >>>>> Subject: Re: Helping out on spark efforts
> >>>>>
> >>>>> I think getting the design right for MAHOUT-1490 is tough. Dmitriy
> >>>>> suggested to update the design example to Scala code and try to work in
> >>>>> things that fit from dply from R and MLTable. I'd love to see such a
> >>>>> design doc.
> >>>>>
> >>>>> --sebastian
> >>>>>
> >>>>> On 04/30/2014 05:02 PM, Ted Dunning wrote:
> >>>>>> +1 for foundations first.
> >>>>>>
> >>>>>> There are bunches of algorithms just behind that.  K-means.  
> >>>>>> SGD+Adagrad
> >>>>>> regression.  Autoencoders.  K-sparse encoding.  Lots of stuff.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <[email protected]> 
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I think you should concentrate on MAHOUT-1490, that is a highly 
> >>>>>>> important
> >>>>>>> task that will be the foundation for a lot of stuff to be built on 
> >>>>>>> top.
> >>>>>>> Let's focus on getting this thing right and then move on to other 
> >>>>>>> things.
> >>>>>>>
> >>>>>>> --sebastian
> >>>>>>>
> >>>>>>>
> >>>>>>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote:
> >>>>>>>
> >>>>>>>> Sebastien/Dmitry,In looking through the current list of issues I 
> >>>>>>>> didnt
> >>>>>>>> see other algorithms in mahout that are talked about being ported to 
> >>>>>>>> spark,
> >>>>>>>> I was wondering if there's any interest/need in porting or writing 
> >>>>>>>> things
> >>>>>>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area 
> >>>>>>>> while
> >>>>>>>> working on 1490.  Also are we planning to port the distributed 
> >>>>>>>> versions of
> >>>>>>>> taste to use spark as well at some point.
> >>>>>>>> Thanks in advance.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>                                          
> >>>                                           
> >>>
> >>
> >                                     
> >
>

RE: Helping out on spark efforts

Reply via email to