Re: Sketching out scala traits and 1.0 API

Dmitriy Lyubimov Fri, 30 May 2014 14:51:28 -0700

I am not sure i understand the question. It would possible to save results
of rowSimilarityJob as a data frame. No, data frames do not support quick
bidirectional indexing on demand in a sense if we wanted to bring full
column or row to front-end process very quickly (e.g. row id -> row vector,
or columnName -> column). They will support iterative filtering and
mutating just like in dplyr package of R. (I hope).

In general, i'd only say that data frames are called data frames because
the scope of functionality and intent is that of R data frames (there's no
other source for the term of "data frame", i.e. matlab doesn't have those i
think) minus quick random individual cell access which is replaced by
dplyr-style FP computations.

So really i'd say one needs to look at dplyr and R to understand the scope
of this at this point in my head.

Filtering over rows (including there labels) is implied by dplyr and R.
column selection pattern is a bit different, via %.% select() and %.%
mutate (it assumes data frames are like tables, few attributes but a lot of
rows). Data frames are therefore do not respond well to linalg operations
that often require a lot of orientation change.

On Fri, May 30, 2014 at 2:36 PM, Pat Ferrel <[email protected]> wrote:

>
> >> Something that concerns me about dataframes is whether they will be
> useful
> >> for batch operations given D’s avowed lack of interest :-)
> >>
> >
> > Pat, please don't dump everything in one  pile :)
> >
>
> Only kidding ——> :-)
>
> >
> > Every other stage here (up to training) are usually either batching or
> > streaming. Data frames are to be used primarily in featurization and
> > vectorization, which is  either streaming (in Spark/Storm sense) or a
> > batch. These stages can benefit from fast columnar organization of data
> > frames allowing fast multiple passes. I can imagine some methodologies in
> > training _may_ work better off data frames too, rather than off the
> > matrices.
> >
> > hope that clarifies.
> >
>
> Well that brings us to the real question: if we need to serialize a drm
> with restored user specified row and column IDs do you expect  some future
> dataframe will support this well? I’d guess this would be some kind of .map
> over rows. Like this, only getting ID values from the dataframe:
>
>       matrix.rdd.map({ case (rowID, itemVector) =>
>         var line: String = rowIDDictionary.inverse.get(rowID) + outDelim1
>         for (item <- itemVector.nonZeroes()) {
>           line += columnIDDictionary.inverse.get(item.index) + outDelim2 +
> item.get + outDelim3
>         }
>         line.dropRight(1)
>       })
>         .saveAsTextFile(dest)
>
> A similar question applies to deserializing or building a dataframe. I ask
> because IndexedDataset uses does Guava HashBiMaps in memory on all cluster
> machines. Seems like a potential scaling issue but then a distributed
> HashMap is called a database.

Re: Sketching out scala traits and 1.0 API

Reply via email to