RE: Sketching out scala traits and 1.0 API

Andrew Palumbo Fri, 30 May 2014 15:05:25 -0700

Just jumping in here real quick.. not trying to derail the conversation...

I have a lot of catching up to do on the status of the Dataframe 
implementation, the DSL, Pat's ItemSimiliarity implementation so that i can 
better understand what's going on and. I'm going to try to take a look at this 
stuff over the weekend


I think i see how my thinking of this has been wrong in terms of "Translating a 
Dataframe to a DRM".  Also I think that NB was a bad example because it's kind 
of a special case classifier.

I guess from my end what im wondering of in terms of laying out traits for 
classifiers is are we going to try to provide a kind of weka or R-like 
pluggable interface? and if so, how would that look?  I guess I'm speaking 
specifically about about batch trained, supervised, classification algorithms 
at this point. (Which im not sure going forward if anybody is interested in, 
but I am).

For example, I'm doing some work right that involves comparing results from 
some off the shelf algorithms. Working in R, with a small dense dataset- 
nothing really novel.  Once my dataframe is all set up, switching classifiers 
looks like basically like this:

# Train a random forest
res.rf <- randomForest( formula=formula, data=d_train, nodesize=1,
                        classwt=CLASSWT, sampsize=length(d_train[,1]), 
                        proximity=F, na.action=na.roughfix, ntree=1000)  
# Train an rPartTree
res.rf <-rpart( formula=formula, data=d_train, method="class",
                control=rpart.control(minsplit=2, cp=0))

I know that this is not that useful to the typical Mahout user right now.  But 
with a shell/script, a Linear Algebra DSL with a distributed back end and a 
bunch of algorithms in the library, i think that this will be, or will draw in 
new users.  

The reason I brought up the full NB pipeline is to ensure that if we are to lay 
out traits for new (classification) algorithms, it is done so in a the most 
robust way possible, and in a way that eases development from prototyping in 
the shell to deployment.     

 



> Date: Fri, 30 May 2014 14:54:20 -0700
> Subject: Re: Sketching out scala traits and 1.0 API
> From: [email protected]
> To: [email protected]
> 
> Frankly, except for columnar organization and sine math summarization
> functionality,  i don't see much difference between these data frames and
> e.g. scalding tuple-based manipulations.
> 
> 
> On Fri, May 30, 2014 at 2:50 PM, Dmitriy Lyubimov <[email protected]> wrote:
> 
> > I am not sure i understand the question. It would possible to save results
> > of rowSimilarityJob as a data frame. No, data frames do not support quick
> > bidirectional indexing on demand in a sense if we wanted to bring full
> > column or row to front-end process very quickly (e.g. row id -> row vector,
> > or columnName -> column). They will support iterative filtering and
> > mutating just like in dplyr package of R. (I hope).
> >
> > In general, i'd only say that data frames are called data frames because
> > the scope of functionality and intent is that of R data frames (there's no
> > other source for the term of "data frame", i.e. matlab doesn't have those i
> > think) minus quick random individual cell access which is replaced by
> > dplyr-style FP computations.
> >
> > So really i'd say one needs to look at dplyr and R to understand the scope
> > of this at this point in my head.
> >
> > Filtering over rows (including there labels) is implied by dplyr and R.
> > column selection pattern is a bit different, via %.% select() and %.%
> > mutate (it assumes data frames are like tables, few attributes but a lot of
> > rows). Data frames are therefore do not respond well to linalg operations
> > that often require a lot of orientation change.
> >
> >
> >
> > On Fri, May 30, 2014 at 2:36 PM, Pat Ferrel <[email protected]> wrote:
> >
> >>
> >> >> Something that concerns me about dataframes is whether they will be
> >> useful
> >> >> for batch operations given D’s avowed lack of interest :-)
> >> >>
> >> >
> >> > Pat, please don't dump everything in one  pile :)
> >> >
> >>
> >> Only kidding ——> :-)
> >>
> >> >
> >> > Every other stage here (up to training) are usually either batching or
> >> > streaming. Data frames are to be used primarily in featurization and
> >> > vectorization, which is  either streaming (in Spark/Storm sense) or a
> >> > batch. These stages can benefit from fast columnar organization of data
> >> > frames allowing fast multiple passes. I can imagine some methodologies
> >> in
> >> > training _may_ work better off data frames too, rather than off the
> >> > matrices.
> >> >
> >> > hope that clarifies.
> >> >
> >>
> >> Well that brings us to the real question: if we need to serialize a drm
> >> with restored user specified row and column IDs do you expect  some future
> >> dataframe will support this well? I’d guess this would be some kind of .map
> >> over rows. Like this, only getting ID values from the dataframe:
> >>
> >>       matrix.rdd.map({ case (rowID, itemVector) =>
> >>         var line: String = rowIDDictionary.inverse.get(rowID) + outDelim1
> >>         for (item <- itemVector.nonZeroes()) {
> >>           line += columnIDDictionary.inverse.get(item.index) + outDelim2
> >> + item.get + outDelim3
> >>         }
> >>         line.dropRight(1)
> >>       })
> >>         .saveAsTextFile(dest)
> >>
> >> A similar question applies to deserializing or building a dataframe. I
> >> ask because IndexedDataset uses does Guava HashBiMaps in memory on all
> >> cluster machines. Seems like a potential scaling issue but then a
> >> distributed HashMap is called a database.
> >
> >
> >

RE: Sketching out scala traits and 1.0 API

Reply via email to