I'll add the example associated with Mahout-1518 in the integration API section, to be clear per the initial feedback I tried to "narrow the scope" of this effort by adding more examples around the dplyr and mltables functionality that I felt would be relevant towards the concept of a dataframe. Are there other things missing from the APIs I am suggesting, would love to add them in 1 fell swoop :))). I wouldn't necessarily say the introduction of connection to a remote datasource and manipulating its contents inside a dataframe is distracting, infact dplyr is doing that now and I think it might be useful to take an RDD in the context of spark and bring a subset by applying a set of functions on top of that into a dataframe. Keep feedback coming as you guys look through the API.
> Date: Sun, 4 May 2014 13:20:03 +0200 > From: [email protected] > To: [email protected] > Subject: Re: Helping out on spark efforts > > I think we should concentrate on getting the core functionality right, > and test that on a few examples. We should narrow the scope of this and > avoid getting distracted by thinking about adding something generalizes > NoSQL-queries or so... > > One thing that I would like to see is an example of how to handle the > input for a cooccurrence-based recommender in MAHOUT-1518 > > Say the raw data looks like this: > > timestamp1, userIdString1, itemIdString1, “view" > timestamp2, userIdString2, itemIdString1, “like" > ... > > > What we want in the end is two DRMs with int keys having users as rows > and items as columns. One DRM should contain all the views, the other > all the likes (e.g. for every userIdString, itemIdString pair present, > there is a 1 in the corresponding cell of the matrix). > > The result of the cooccurrence analysis is a set of int-keyed item-item > matrices. We should be able to map the int keys back to the original > itemIdStrings. > > Would love to see how that example looks like in your proposed DataFrame. > > > --sebastian > > > > On 05/04/2014 07:17 AM, Saikat Kanjilal wrote: > > Me again :), added a subset of the definitions from the dplyr functionality > > to the integration API section as promised , examples include > > compute/filter/chain etc. My next steps will be adding concrete examples > > underneath each of the newly created Integration APIs, at a high level here > > are the domain objects I am thinking will need to exist and be referenced > > in the DataFrame world: > > DataFrame (self explanatory)Query (a generalized abstraction around a > > query==could represent a sql/nosql query or an hdfs query)RDD (important > > domain object that could be returned by one or more of our APIs)Destination > > (a remote data source, could be a table/ a location in hdfs etc)Connection > > (a remote database connection to use to perform transactional operations > > between the dataFrame and a remote database) > > Had an additional thought, might we at some point want to operate on > > matrices and mathematically perform operations with matrices and > > dataFrames, would love to hear from committers as to whether this may be > > useful and I can add in APIs around this as well. > > One thing that I've also been pondering is whether or how to handle errors > > in any of these APIs, one thought I had was to introduce a generalized > > error object that can be reused on all of the APIs, maybe something that > > contains a message and an error code or something similar, an alternative > > idea is to leverage something already existing in the spark bindings if > > possible. > > Would love for folks to take a look through the APIs as I expand them and > > add more examples and leave comments on JIRA ticket, also I'm thinking that > > since the stuff around performing slicing/CRUD functionality around > > dataFrames is pretty commonly understood I may take those examples out and > > put more examples in around the APIs for dplyr and mltables. > > Blog: > > http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html > > JIRA: https://issues.apache.org/jira/browse/MAHOUT-1490 > > > > Regards > > > > > > > >> From: [email protected] > >> To: [email protected] > >> Subject: RE: Helping out on spark efforts > >> Date: Sat, 3 May 2014 10:09:51 -0700 > >> > >> I've taken a stab at adding a subset of the functionality used by MLTable > >> operators into the blog on top of the R CRUD functionality I listed > >> earlier into the integration API section of the blog, please review and > >> let me know your thoughts, will be tackling the dplyr functionality next > >> and adding that in , blog is shown below, again please see the integration > >> API section for details: > >> > >> http://mlefforts.blogspot.com/2014/04/introduction-this-proposal-will.html > >> > >> Look forward to hearing comments either on the list on the jira ticket > >> itself: > >> https://issues.apache.org/jira/browse/MAHOUT-1490 > >> Thanks in advance. > >> > >>> Date: Wed, 30 Apr 2014 17:13:52 +0200 > >>> From: [email protected] > >>> To: [email protected]; [email protected] > >>> Subject: Re: Helping out on spark efforts > >>> > >>> I think getting the design right for MAHOUT-1490 is tough. Dmitriy > >>> suggested to update the design example to Scala code and try to work in > >>> things that fit from dply from R and MLTable. I'd love to see such a > >>> design doc. > >>> > >>> --sebastian > >>> > >>> On 04/30/2014 05:02 PM, Ted Dunning wrote: > >>>> +1 for foundations first. > >>>> > >>>> There are bunches of algorithms just behind that. K-means. SGD+Adagrad > >>>> regression. Autoencoders. K-sparse encoding. Lots of stuff. > >>>> > >>>> > >>>> > >>>> On Wed, Apr 30, 2014 at 4:52 PM, Sebastian Schelter <[email protected]> > >>>> wrote: > >>>> > >>>>> I think you should concentrate on MAHOUT-1490, that is a highly > >>>>> important > >>>>> task that will be the foundation for a lot of stuff to be built on top. > >>>>> Let's focus on getting this thing right and then move on to other > >>>>> things. > >>>>> > >>>>> --sebastian > >>>>> > >>>>> > >>>>> On 04/30/2014 04:44 PM, Saikat Kanjilal wrote: > >>>>> > >>>>>> Sebastien/Dmitry,In looking through the current list of issues I didnt > >>>>>> see other algorithms in mahout that are talked about being ported to > >>>>>> spark, > >>>>>> I was wondering if there's any interest/need in porting or writing > >>>>>> things > >>>>>> like LR/KMeans/SVM to use spark, I'd like to help out in this area > >>>>>> while > >>>>>> working on 1490. Also are we planning to port the distributed > >>>>>> versions of > >>>>>> taste to use spark as well at some point. > >>>>>> Thanks in advance. > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > > >
