[
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005049#comment-14005049
]
Dmitriy Lyubimov commented on MAHOUT-1490:
------------------------------------------
ok. thank you all, it was helpful.
First, few clarifications on questions
bq. Naive question - Are these "Data frame" bindings really for just
interactive use case? Or do we expect ML algos to be implemented on top of Data
frames (instead of just DRM/matrix)?
I don't know -- what are matrices vs. data frames in R? Same here. There are
algorithms that run on Data Frames. There are algorithms that run on matrices.
I can tell you what I need data frames for.
I need them for business rule data manipulation per dplyr/MLTable apis, since
matrices do not support those.
I also need them to represent feature data such as text or category, since
matrices do not support anything but real values.
I need DF for so called standartization (vectorization) of such features.
I need DF to build hashing trick vectorization.
I probably need DF for outlier detection.
bq. It's just plain good Physics as to why the world works that way. <...?
Compression/Decompression ain't gonna matter here; it'
Ok i think we can agree that if we rewrite the entire vector, it is not just
not going to matter, it is simply an extra what otherwise is being done. In my
business rules code i done in R in past week for a new model feed, i found more
than trivial amount of me doing column replacement with a completely new column.
Here is what i think
(1) we need both compressed and uncompressed representation of dense
beyond-numeric vectors.
(2) we should use compression whenever I/O serialization is involved.
(3) we should use compression whenever cached checkpoint is created (as this
almost always implies repeated read re-use).
(4) Otherwise, lazy compression policy by default: we don't compress result
unless specific api is involved instructing to perform such transformation for
requested columns explicitly (except for cases mentioned above).
> Data frame R-like bindings
> --------------------------
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
> Issue Type: New Feature
> Reporter: Saikat Kanjilal
> Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
> Original Estimate: 20h
> Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark
--
This message was sent by Atlassian JIRA
(v6.2#6252)