I am not sure i understand the question. It would possible to save results of rowSimilarityJob as a data frame. No, data frames do not support quick bidirectional indexing on demand in a sense if we wanted to bring full column or row to front-end process very quickly (e.g. row id -> row vector, or columnName -> column). They will support iterative filtering and mutating just like in dplyr package of R. (I hope).
In general, i'd only say that data frames are called data frames because the scope of functionality and intent is that of R data frames (there's no other source for the term of "data frame", i.e. matlab doesn't have those i think) minus quick random individual cell access which is replaced by dplyr-style FP computations. So really i'd say one needs to look at dplyr and R to understand the scope of this at this point in my head. Filtering over rows (including there labels) is implied by dplyr and R. column selection pattern is a bit different, via %.% select() and %.% mutate (it assumes data frames are like tables, few attributes but a lot of rows). Data frames are therefore do not respond well to linalg operations that often require a lot of orientation change. On Fri, May 30, 2014 at 2:36 PM, Pat Ferrel <[email protected]> wrote: > > >> Something that concerns me about dataframes is whether they will be > useful > >> for batch operations given D’s avowed lack of interest :-) > >> > > > > Pat, please don't dump everything in one pile :) > > > > Only kidding ——> :-) > > > > > Every other stage here (up to training) are usually either batching or > > streaming. Data frames are to be used primarily in featurization and > > vectorization, which is either streaming (in Spark/Storm sense) or a > > batch. These stages can benefit from fast columnar organization of data > > frames allowing fast multiple passes. I can imagine some methodologies in > > training _may_ work better off data frames too, rather than off the > > matrices. > > > > hope that clarifies. > > > > Well that brings us to the real question: if we need to serialize a drm > with restored user specified row and column IDs do you expect some future > dataframe will support this well? I’d guess this would be some kind of .map > over rows. Like this, only getting ID values from the dataframe: > > matrix.rdd.map({ case (rowID, itemVector) => > var line: String = rowIDDictionary.inverse.get(rowID) + outDelim1 > for (item <- itemVector.nonZeroes()) { > line += columnIDDictionary.inverse.get(item.index) + outDelim2 + > item.get + outDelim3 > } > line.dropRight(1) > }) > .saveAsTextFile(dest) > > A similar question applies to deserializing or building a dataframe. I ask > because IndexedDataset uses does Guava HashBiMaps in memory on all cluster > machines. Seems like a potential scaling issue but then a distributed > HashMap is called a database.
