> The aim of this issue is to get the initial design right for a slim, but > powerful dataframe. I talked about narrowing the scope w.r.t. to these > features that you proposed: > > * "transactional operations between the dataFrame and a remote database" > * introduction of a "a generalized abstraction around a query==could > represent a sql/nosql query or an hdfs query" > ***Understood, wil try to narrow down the scope even further and remove these assuming no-one else is interested. I certainly understand the sentiment of getting something rolling with as narrow a focus as possible to make this successful, I'll keep working on the proposal based on your feedback and send periodic updates.
> Date: Sun, 4 May 2014 17:26:39 +0200 > From: [email protected] > To: [email protected] > Subject: Re: Helping out on spark efforts > > Saikat, > > The aim of this issue is to get the initial design right for a slim, but > powerful dataframe. I talked about narrowing the scope w.r.t. to these > features that you proposed: > > * "transactional operations between the dataFrame and a remote database" > * introduction of a "a generalized abstraction around a query==could > represent a sql/nosql query or an hdfs query" > > At this point, I would veto any patch that tries to address these things. > > --sebastian > > > On 05/04/2014 04:31 PM, Saikat Kanjilal wrote: > > I'll add the example associated with Mahout-1518 in the integration API > > section, to be clear per the initial feedback I tried to "narrow the scope" > > of this effort by adding more examples around the dplyr and mltables > > functionality that I felt would be relevant towards the concept of a > > dataframe. Are there other things missing from the APIs I am suggesting, > > would love to add them in 1 fell swoop :))). I wouldn't necessarily say > > the introduction of connection to a remote datasource and manipulating its > > contents inside a dataframe is distracting, infact dplyr is doing that now > > and I think it might be useful to take an RDD in the context of spark and > > bring a subset by applying a set of functions on top of that into a > > dataframe. > > Keep feedback coming as you guys look through the API. > > > >> Date: Sun, 4 May 2014 13:20:03 +0200 > >> From: [email protected] > >> To: [email protected] > >> Subject: Re: Helping out on spark efforts > >> > >> I think we should concentrate on getting the core functionality right, > >> and test that on a few examples. We should narrow the scope of this and > >> avoid getting distracted by thinking about adding something generalizes > >> NoSQL-queries or so... > >> > >> One thing that I would like to see is an example of how to handle the > >> input for a cooccurrence-based recommender in MAHOUT-1518 > >> > >> Say the raw data looks like this: > >> > >> timestamp1, userIdString1, itemIdString1, “view" > >> timestamp2, userIdString2, itemIdString1, “like" > >> ... > >> > >> > >> What we want in the end is two DRMs with int keys having users as rows > >> and items as columns. One DRM should contain all the views, the other > >> all the likes (e.g. for every userIdString, itemIdString pair present, > >> there is a 1 in the corresponding cell of the matrix). > >> > >> The result of the cooccurrence analysis is a set of int-keyed item-item > >> matrices. We should be able to map the int keys back to the original > >> itemIdStrings. > >> > >> Would love to see how that example looks like in your proposed DataFrame. > >> > >> > >> --sebastian > >> > >> > >> > >> On 05/04/2014 07:17 AM, Saikat Kanjilal wrote: > >>> Me again :), added a subset of the definitions from the dplyr > >>> functionality to the integration API section as promised , examples > >>> include compute/filter/chain etc. My next steps will be adding > >>> concrete examples underneath each of the newly created Integration APIs, > >>> at a high level here are the domain objects I am thinking will need to > >>> exist and be referenced in the DataFrame world: > >>> DataFrame (self explanatory)Query (a generalized abstraction around a > >>> query==could represent a sql/nosql query or an hdfs query)RDD (important > >>> domain object that could be returned by one or more of our > >>> APIs)Destination (a remote data source, could be a table/ a location in > >>> hdfs etc)Connection (a remote database connection to use to perform > >>> transactional operations between the dataFrame and a remote database) > >>> Had an additional thought, might we at some point want to operate on > >>> matrices and mathematically perform operations with matrices and > >>> dataFrames, would love to hear from committers as to whether this may be > >>> useful and I can add in APIs around this as well. > >>> One thing that I've also been pondering is whether or how to handle > >>> errors in any of these APIs, one thought I had was to introduce a > >>> generalized error object that can be reused on all of the APIs, maybe > >>> something that contains a message and an error code or something similar, > >>> an alternative idea is to leverage something already existing in the > >>> spark bindings if possible. > >>> Would love for folks to take a look through the APIs as I expand them and > >>> add more examples and leave comments on JIRA ticket, also I'm thinking > >>> that since the stuff around performing slicing/CRUD functionality around > >>> dataFrames is pretty commonly understood I may take those examples out > >>> and put more examples in around the APIs for dplyr and mltables. > >>> Blog: > >>> http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html > >>> JIRA: https://issues.apache.org/jira/browse/MAHOUT-1490 > >>> > >>> Regards > >>> > >>> > >>> > >>>> From: [email protected] > >>>> To: [email protected] > >>>> Subject: RE: Helping out on spark efforts > >>>> Date: Sat, 3 May 2014 10:09:51 -0700 > >>>> > >>>> I've taken a stab at adding a subset of the functionality used by > >>>> MLTable operators into the blog on top of the R CRUD functionality I > >>>> listed earlier into the integration API section of the blog, please > >>>> review and let me know your thoughts, will be tackling the dplyr > >>>> functionality next and adding that in , blog is shown below, again > >>>> please see the integration API section for details: > >>>> > >>>> http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html > >>>> > >>>> Look forward to hearing comments either on the list on the jira ticket > >>>> itself: > >>>> https://issues.apache.org/jira/browse/MAHOUT-1490 > >>>> Thanks in advance. > >>>> > >>>>> Date: Wed, 30 Apr 2014 17:13:52 +0200 > >>>>> From: [email protected] > >>>>> To: [email protected]; [email protected] > >>>>> Subject: Re: Helping out on spark efforts > >>>>> > >>>>> I think getting the design right for MAHOUT-1490 is tough. Dmitriy > >>>>> suggested to update the design example to Scala code and try to work in > >>>>> things that fit from dply from R and MLTable. I'd love to see such a > >>>>> design doc. > >>>>> > >>>>> --sebastian > >>>>> > >>>>> On 04/30/2014 05:02 PM, Ted Dunning wrote: > >>>>>> +1 for foundations first. > >>>>>> > >>>>>> There are bunches of algorithms just behind that. K-means. > >>>>>> SGD+Adagrad > >>>>>> regression. Autoencoders. K-sparse encoding. Lots of stuff. > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <[email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> I think you should concentrate on MAHOUT-1490, that is a highly > >>>>>>> important > >>>>>>> task that will be the foundation for a lot of stuff to be built on > >>>>>>> top. > >>>>>>> Let's focus on getting this thing right and then move on to other > >>>>>>> things. > >>>>>>> > >>>>>>> --sebastian > >>>>>>> > >>>>>>> > >>>>>>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote: > >>>>>>> > >>>>>>>> Sebastien/Dmitry,In looking through the current list of issues I > >>>>>>>> didnt > >>>>>>>> see other algorithms in mahout that are talked about being ported to > >>>>>>>> spark, > >>>>>>>> I was wondering if there's any interest/need in porting or writing > >>>>>>>> things > >>>>>>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area > >>>>>>>> while > >>>>>>>> working on 1490. Also are we planning to port the distributed > >>>>>>>> versions of > >>>>>>>> taste to use spark as well at some point. > >>>>>>>> Thanks in advance. > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > >
