[
https://issues.apache.org/jira/browse/MAHOUT-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004783#comment-14004783
]
Cliff Click commented on MAHOUT-1490:
-------------------------------------
I didn't see any obvious errors in your responses.
These complaints about inflation/deflation on random access are "true" but
generally groundless worries.
If you are doing truly random access, then performance of ALL algo's is gonna
suck; ALL modern hardware and certainly all X86's are heavily optimized around
re-use in space and time; especially the obvious linear-access case. It's just
plain good Physics as to why the world works that way. There's an easy 10x to
100x or better, going in a straight line over all the data, vs randomly popping
about. Compression/Decompression ain't gonna matter here; it's all about
Physics and trading off latency vs bandwidth.
I think physics is gonna dictate that random-access-algo's are gonna lose out
to bulk algos, just because you can get so much more work done in the same
period of time. Perhaps there's a middle ground; where - at the cost of 1
random access - you grab the 100 nearby neighbors, and do work with 100
elements instead of 1.
The inflate/deflate cycle only kicks in if you're dramatically changing the
"shape" of the data. Nearly always this isn't true.
Example: hacking tree or array-indices; always the indices are small integers
even as they change constantly. Compress handily back into any of the
small-integer formats.
Example: hacking regression values in an iterative algo; always the predictors
are "floats" or "doubles", and handily get stored back into the standard
float/double "not really compressed" formats.
What IS true, and expensive, is the open/close cycle we use around Chunks to
track when changes happen (visibility of changes & coherence around the
cluster). Random reads don't pay this, but random writes do. Normally this
cost is amortized over visiting the entire Chunk, but it's very real if you are
reading only a few elements and writing at least once.
Cliff
> Data frame R-like bindings
> --------------------------
>
> Key: MAHOUT-1490
> URL: https://issues.apache.org/jira/browse/MAHOUT-1490
> Project: Mahout
> Issue Type: New Feature
> Reporter: Saikat Kanjilal
> Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
> Original Estimate: 20h
> Remaining Estimate: 20h
>
> Create Data frame R-like bindings for spark
--
This message was sent by Atlassian JIRA
(v6.2#6252)