>>IMO we should wait on core DSL functionality if it’s >>not there but if you are doing something that is external >>then full blown dataframes may not block you or even help you. >>Drms are pretty mature. You’ll have to decide that based on >>your own needs.Also wanted to say I agree completely- not trying to jump the >>gun on this.
From: [email protected] To: [email protected] Subject: RE: Sketching out scala traits and 1.0 API Date: Fri, 30 May 2014 18:04:33 -0400 Just jumping in here real quick.. not trying to derail the conversation... I have a lot of catching up to do on the status of the Dataframe implementation, the DSL, Pat's ItemSimiliarity implementation so that i can better understand what's going on and. I'm going to try to take a look at this stuff over the weekend I think i see how my thinking of this has been wrong in terms of "Translating a Dataframe to a DRM". Also I think that NB was a bad example because it's kind of a special case classifier. I guess from my end what im wondering of in terms of laying out traits for classifiers is are we going to try to provide a kind of weka or R-like pluggable interface? and if so, how would that look? I guess I'm speaking specifically about about batch trained, supervised, classification algorithms at this point. (Which im not sure going forward if anybody is interested in, but I am). For example, I'm doing some work right that involves comparing results from some off the shelf algorithms. Working in R, with a small dense dataset- nothing really novel. Once my dataframe is all set up, switching classifiers looks like basically like this: # Train a random forest res.rf <- randomForest( formula=formula, data=d_train, nodesize=1, classwt=CLASSWT, sampsize=length(d_train[,1]), proximity=F, na.action=na.roughfix, ntree=1000) # Train an rPartTree res.rf <-rpart( formula=formula, data=d_train, method="class", control=rpart.control(minsplit=2, cp=0)) I know that this is not that useful to the typical Mahout user right now. But with a shell/script, a Linear Algebra DSL with a distributed back end and a bunch of algorithms in the library, i think that this will be, or will draw in new users. The reason I brought up the full NB pipeline is to ensure that if we are to lay out traits for new (classification) algorithms, it is done so in a the most robust way possible, and in a way that eases development from prototyping in the shell to deployment. > Date: Fri, 30 May 2014 14:54:20 -0700 > Subject: Re: Sketching out scala traits and 1.0 API > From: [email protected] > To: [email protected] > > Frankly, except for columnar organization and sine math summarization > functionality, i don't see much difference between these data frames and > e.g. scalding tuple-based manipulations. > > > On Fri, May 30, 2014 at 2:50 PM, Dmitriy Lyubimov <[email protected]> wrote: > > > I am not sure i understand the question. It would possible to save results > > of rowSimilarityJob as a data frame. No, data frames do not support quick > > bidirectional indexing on demand in a sense if we wanted to bring full > > column or row to front-end process very quickly (e.g. row id -> row vector, > > or columnName -> column). They will support iterative filtering and > > mutating just like in dplyr package of R. (I hope). > > > > In general, i'd only say that data frames are called data frames because > > the scope of functionality and intent is that of R data frames (there's no > > other source for the term of "data frame", i.e. matlab doesn't have those i > > think) minus quick random individual cell access which is replaced by > > dplyr-style FP computations. > > > > So really i'd say one needs to look at dplyr and R to understand the scope > > of this at this point in my head. > > > > Filtering over rows (including there labels) is implied by dplyr and R. > > column selection pattern is a bit different, via %.% select() and %.% > > mutate (it assumes data frames are like tables, few attributes but a lot of > > rows). Data frames are therefore do not respond well to linalg operations > > that often require a lot of orientation change. > > > > > > > > On Fri, May 30, 2014 at 2:36 PM, Pat Ferrel <[email protected]> wrote: > > > >> > >> >> Something that concerns me about dataframes is whether they will be > >> useful > >> >> for batch operations given D’s avowed lack of interest :-) > >> >> > >> > > >> > Pat, please don't dump everything in one pile :) > >> > > >> > >> Only kidding ——> :-) > >> > >> > > >> > Every other stage here (up to training) are usually either batching or > >> > streaming. Data frames are to be used primarily in featurization and > >> > vectorization, which is either streaming (in Spark/Storm sense) or a > >> > batch. These stages can benefit from fast columnar organization of data > >> > frames allowing fast multiple passes. I can imagine some methodologies > >> in > >> > training _may_ work better off data frames too, rather than off the > >> > matrices. > >> > > >> > hope that clarifies. > >> > > >> > >> Well that brings us to the real question: if we need to serialize a drm > >> with restored user specified row and column IDs do you expect some future > >> dataframe will support this well? I’d guess this would be some kind of .map > >> over rows. Like this, only getting ID values from the dataframe: > >> > >> matrix.rdd.map({ case (rowID, itemVector) => > >> var line: String = rowIDDictionary.inverse.get(rowID) + outDelim1 > >> for (item <- itemVector.nonZeroes()) { > >> line += columnIDDictionary.inverse.get(item.index) + outDelim2 > >> + item.get + outDelim3 > >> } > >> line.dropRight(1) > >> }) > >> .saveAsTextFile(dest) > >> > >> A similar question applies to deserializing or building a dataframe. I > >> ask because IndexedDataset uses does Guava HashBiMaps in memory on all > >> cluster machines. Seems like a potential scaling issue but then a > >> distributed HashMap is called a database. > > > > > >
