>> Something that concerns me about dataframes is whether they will be useful
>> for batch operations given D’s avowed lack of interest :-)
>> 
> 
> Pat, please don't dump everything in one  pile :)
> 

Only kidding ——> :-)

> 
> Every other stage here (up to training) are usually either batching or
> streaming. Data frames are to be used primarily in featurization and
> vectorization, which is  either streaming (in Spark/Storm sense) or a
> batch. These stages can benefit from fast columnar organization of data
> frames allowing fast multiple passes. I can imagine some methodologies in
> training _may_ work better off data frames too, rather than off the
> matrices.
> 
> hope that clarifies.
> 

Well that brings us to the real question: if we need to serialize a drm with 
restored user specified row and column IDs do you expect  some future dataframe 
will support this well? I’d guess this would be some kind of .map over rows. 
Like this, only getting ID values from the dataframe:

      matrix.rdd.map({ case (rowID, itemVector) =>
        var line: String = rowIDDictionary.inverse.get(rowID) + outDelim1
        for (item <- itemVector.nonZeroes()) {
          line += columnIDDictionary.inverse.get(item.index) + outDelim2 + 
item.get + outDelim3
        }
        line.dropRight(1)
      })
        .saveAsTextFile(dest)

A similar question applies to deserializing or building a dataframe. I ask 
because IndexedDataset uses does Guava HashBiMaps in memory on all cluster 
machines. Seems like a potential scaling issue but then a distributed HashMap 
is called a database.

Reply via email to