Frankly, except for columnar organization and sine math summarization functionality, i don't see much difference between these data frames and e.g. scalding tuple-based manipulations.
On Fri, May 30, 2014 at 2:50 PM, Dmitriy Lyubimov <[email protected]> wrote: > I am not sure i understand the question. It would possible to save results > of rowSimilarityJob as a data frame. No, data frames do not support quick > bidirectional indexing on demand in a sense if we wanted to bring full > column or row to front-end process very quickly (e.g. row id -> row vector, > or columnName -> column). They will support iterative filtering and > mutating just like in dplyr package of R. (I hope). > > In general, i'd only say that data frames are called data frames because > the scope of functionality and intent is that of R data frames (there's no > other source for the term of "data frame", i.e. matlab doesn't have those i > think) minus quick random individual cell access which is replaced by > dplyr-style FP computations. > > So really i'd say one needs to look at dplyr and R to understand the scope > of this at this point in my head. > > Filtering over rows (including there labels) is implied by dplyr and R. > column selection pattern is a bit different, via %.% select() and %.% > mutate (it assumes data frames are like tables, few attributes but a lot of > rows). Data frames are therefore do not respond well to linalg operations > that often require a lot of orientation change. > > > > On Fri, May 30, 2014 at 2:36 PM, Pat Ferrel <[email protected]> wrote: > >> >> >> Something that concerns me about dataframes is whether they will be >> useful >> >> for batch operations given D’s avowed lack of interest :-) >> >> >> > >> > Pat, please don't dump everything in one pile :) >> > >> >> Only kidding ——> :-) >> >> > >> > Every other stage here (up to training) are usually either batching or >> > streaming. Data frames are to be used primarily in featurization and >> > vectorization, which is either streaming (in Spark/Storm sense) or a >> > batch. These stages can benefit from fast columnar organization of data >> > frames allowing fast multiple passes. I can imagine some methodologies >> in >> > training _may_ work better off data frames too, rather than off the >> > matrices. >> > >> > hope that clarifies. >> > >> >> Well that brings us to the real question: if we need to serialize a drm >> with restored user specified row and column IDs do you expect some future >> dataframe will support this well? I’d guess this would be some kind of .map >> over rows. Like this, only getting ID values from the dataframe: >> >> matrix.rdd.map({ case (rowID, itemVector) => >> var line: String = rowIDDictionary.inverse.get(rowID) + outDelim1 >> for (item <- itemVector.nonZeroes()) { >> line += columnIDDictionary.inverse.get(item.index) + outDelim2 >> + item.get + outDelim3 >> } >> line.dropRight(1) >> }) >> .saveAsTextFile(dest) >> >> A similar question applies to deserializing or building a dataframe. I >> ask because IndexedDataset uses does Guava HashBiMaps in memory on all >> cluster machines. Seems like a potential scaling issue but then a >> distributed HashMap is called a database. > > >
