Just jumping in here real quick.. not trying to derail the conversation...
I have a lot of catching up to do on the status of the Dataframe
implementation, the DSL, Pat's ItemSimiliarity implementation so that i can
better understand what's going on and. I'm going to try to take a look at this
stuff over the weekend
I think i see how my thinking of this has been wrong in terms of "Translating a
Dataframe to a DRM". Also I think that NB was a bad example because it's kind
of a special case classifier.
I guess from my end what im wondering of in terms of laying out traits for
classifiers is are we going to try to provide a kind of weka or R-like
pluggable interface? and if so, how would that look? I guess I'm speaking
specifically about about batch trained, supervised, classification algorithms
at this point. (Which im not sure going forward if anybody is interested in,
but I am).
For example, I'm doing some work right that involves comparing results from
some off the shelf algorithms. Working in R, with a small dense dataset-
nothing really novel. Once my dataframe is all set up, switching classifiers
looks like basically like this:
# Train a random forest
res.rf <- randomForest( formula=formula, data=d_train, nodesize=1,
classwt=CLASSWT, sampsize=length(d_train[,1]),
proximity=F, na.action=na.roughfix, ntree=1000)
# Train an rPartTree
res.rf <-rpart( formula=formula, data=d_train, method="class",
control=rpart.control(minsplit=2, cp=0))
I know that this is not that useful to the typical Mahout user right now. But
with a shell/script, a Linear Algebra DSL with a distributed back end and a
bunch of algorithms in the library, i think that this will be, or will draw in
new users.
The reason I brought up the full NB pipeline is to ensure that if we are to lay
out traits for new (classification) algorithms, it is done so in a the most
robust way possible, and in a way that eases development from prototyping in
the shell to deployment.
> Date: Fri, 30 May 2014 14:54:20 -0700
> Subject: Re: Sketching out scala traits and 1.0 API
> From: [email protected]
> To: [email protected]
>
> Frankly, except for columnar organization and sine math summarization
> functionality, i don't see much difference between these data frames and
> e.g. scalding tuple-based manipulations.
>
>
> On Fri, May 30, 2014 at 2:50 PM, Dmitriy Lyubimov <[email protected]> wrote:
>
> > I am not sure i understand the question. It would possible to save results
> > of rowSimilarityJob as a data frame. No, data frames do not support quick
> > bidirectional indexing on demand in a sense if we wanted to bring full
> > column or row to front-end process very quickly (e.g. row id -> row vector,
> > or columnName -> column). They will support iterative filtering and
> > mutating just like in dplyr package of R. (I hope).
> >
> > In general, i'd only say that data frames are called data frames because
> > the scope of functionality and intent is that of R data frames (there's no
> > other source for the term of "data frame", i.e. matlab doesn't have those i
> > think) minus quick random individual cell access which is replaced by
> > dplyr-style FP computations.
> >
> > So really i'd say one needs to look at dplyr and R to understand the scope
> > of this at this point in my head.
> >
> > Filtering over rows (including there labels) is implied by dplyr and R.
> > column selection pattern is a bit different, via %.% select() and %.%
> > mutate (it assumes data frames are like tables, few attributes but a lot of
> > rows). Data frames are therefore do not respond well to linalg operations
> > that often require a lot of orientation change.
> >
> >
> >
> > On Fri, May 30, 2014 at 2:36 PM, Pat Ferrel <[email protected]> wrote:
> >
> >>
> >> >> Something that concerns me about dataframes is whether they will be
> >> useful
> >> >> for batch operations given D’s avowed lack of interest :-)
> >> >>
> >> >
> >> > Pat, please don't dump everything in one pile :)
> >> >
> >>
> >> Only kidding ——> :-)
> >>
> >> >
> >> > Every other stage here (up to training) are usually either batching or
> >> > streaming. Data frames are to be used primarily in featurization and
> >> > vectorization, which is either streaming (in Spark/Storm sense) or a
> >> > batch. These stages can benefit from fast columnar organization of data
> >> > frames allowing fast multiple passes. I can imagine some methodologies
> >> in
> >> > training _may_ work better off data frames too, rather than off the
> >> > matrices.
> >> >
> >> > hope that clarifies.
> >> >
> >>
> >> Well that brings us to the real question: if we need to serialize a drm
> >> with restored user specified row and column IDs do you expect some future
> >> dataframe will support this well? I’d guess this would be some kind of .map
> >> over rows. Like this, only getting ID values from the dataframe:
> >>
> >> matrix.rdd.map({ case (rowID, itemVector) =>
> >> var line: String = rowIDDictionary.inverse.get(rowID) + outDelim1
> >> for (item <- itemVector.nonZeroes()) {
> >> line += columnIDDictionary.inverse.get(item.index) + outDelim2
> >> + item.get + outDelim3
> >> }
> >> line.dropRight(1)
> >> })
> >> .saveAsTextFile(dest)
> >>
> >> A similar question applies to deserializing or building a dataframe. I
> >> ask because IndexedDataset uses does Guava HashBiMaps in memory on all
> >> cluster machines. Seems like a potential scaling issue but then a
> >> distributed HashMap is called a database.
> >
> >
> >