Frankly, except for columnar organization and sine math summarization
functionality,  i don't see much difference between these data frames and
e.g. scalding tuple-based manipulations.


On Fri, May 30, 2014 at 2:50 PM, Dmitriy Lyubimov <[email protected]> wrote:

> I am not sure i understand the question. It would possible to save results
> of rowSimilarityJob as a data frame. No, data frames do not support quick
> bidirectional indexing on demand in a sense if we wanted to bring full
> column or row to front-end process very quickly (e.g. row id -> row vector,
> or columnName -> column). They will support iterative filtering and
> mutating just like in dplyr package of R. (I hope).
>
> In general, i'd only say that data frames are called data frames because
> the scope of functionality and intent is that of R data frames (there's no
> other source for the term of "data frame", i.e. matlab doesn't have those i
> think) minus quick random individual cell access which is replaced by
> dplyr-style FP computations.
>
> So really i'd say one needs to look at dplyr and R to understand the scope
> of this at this point in my head.
>
> Filtering over rows (including there labels) is implied by dplyr and R.
> column selection pattern is a bit different, via %.% select() and %.%
> mutate (it assumes data frames are like tables, few attributes but a lot of
> rows). Data frames are therefore do not respond well to linalg operations
> that often require a lot of orientation change.
>
>
>
> On Fri, May 30, 2014 at 2:36 PM, Pat Ferrel <[email protected]> wrote:
>
>>
>> >> Something that concerns me about dataframes is whether they will be
>> useful
>> >> for batch operations given D’s avowed lack of interest :-)
>> >>
>> >
>> > Pat, please don't dump everything in one  pile :)
>> >
>>
>> Only kidding ——> :-)
>>
>> >
>> > Every other stage here (up to training) are usually either batching or
>> > streaming. Data frames are to be used primarily in featurization and
>> > vectorization, which is  either streaming (in Spark/Storm sense) or a
>> > batch. These stages can benefit from fast columnar organization of data
>> > frames allowing fast multiple passes. I can imagine some methodologies
>> in
>> > training _may_ work better off data frames too, rather than off the
>> > matrices.
>> >
>> > hope that clarifies.
>> >
>>
>> Well that brings us to the real question: if we need to serialize a drm
>> with restored user specified row and column IDs do you expect  some future
>> dataframe will support this well? I’d guess this would be some kind of .map
>> over rows. Like this, only getting ID values from the dataframe:
>>
>>       matrix.rdd.map({ case (rowID, itemVector) =>
>>         var line: String = rowIDDictionary.inverse.get(rowID) + outDelim1
>>         for (item <- itemVector.nonZeroes()) {
>>           line += columnIDDictionary.inverse.get(item.index) + outDelim2
>> + item.get + outDelim3
>>         }
>>         line.dropRight(1)
>>       })
>>         .saveAsTextFile(dest)
>>
>> A similar question applies to deserializing or building a dataframe. I
>> ask because IndexedDataset uses does Guava HashBiMaps in memory on all
>> cluster machines. Seems like a potential scaling issue but then a
>> distributed HashMap is called a database.
>
>
>

Reply via email to