[
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003903#comment-14003903
]
Anand Avati commented on MAHOUT-1490:
-------------------------------------
[~dlyubimov], Compression does not make it read-only, certainly not read-only
like Spark's RDD. Data in a Frame is mutable. Depending on the type of update
either the update is cheap (if a new value can replace old value in-place) or
expensive (inflate, update1, update2, update3 .. deflate) but in any case
happens transparently behind the scene. User just calls set(). However, for the
DSL backend I intend to _not_ mutate Frames and treat them read-only to be
compatible with the Spark RDD model (even though it might not be the most
efficient in certain cases in terms of performance).
Speed to access data is constant time for dense compressed data with negligible
decompression overhead (one multiplaction and one addition instruction with
operands in registers). The chunk header knows the scale-down factor of
compression, so it is a deterministic offset lookup to fetch the compressed
value as well. For sparse data however the worst case is a binary search to
find the physical offset within a Chunk, though there are optimizations to make
further accesses in the same vicinity to happen in constant time.
> Data frame R-like bindings
> --------------------------
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
> Issue Type: New Feature
> Reporter: Saikat Kanjilal
> Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
> Original Estimate: 20h
> Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark
--
This message was sent by Atlassian JIRA
(v6.2#6252)