Wow, the mahout mailing list is busy. Pardon me if I don't keep up with all that flows thru....

The compression/decompression stuff is tightly coupled with a few parts of H2O. The CSV file parser does compression on data-inhale, and this is crucial in parsing datasets that fit in DRAM - after compression, but not before compression - without requiring an intermediate spill-to-disk. (Also "CSV" is loosely defined to cover any line/field oriented text file, including e.g. Hive files, tab-separated, etc)

The data is aligned across the cluster in a way to allow fast parallel access in the common cases of e.g. R data-frame-like access. The expected column-count is expected to be modest - a few 1000's to a max of perhaps a million. No limit on rows (other than what fits in DRAM). The compression follows the alignment in DRAM.

If you want "just the compression", you'll have to take the column-alignment and distribution also.

Ted mentioned byte-code injection - H2O does that for serialization of POJO's (Plain Olde Java Objects). I'll warrant that H2O has the fastest serialize/deserialize path on the planet (with heavy compression of POJO primitive arrays specifically for ML algorithms)... but it's not directly tied to the compression of Big Data in ram. The next step up in speed for decompressing the Big Data WOULD be do to byte-code injection, however. So far we're outrunning memory bandwidth and don't need BCI for the Big Data (but clearly need it for the POJOs).

Cliff


On 4/30/2014 4:49 PM, Ted Dunning wrote:
I couldn't say.

Let's invite the 0xdata to show us what can happen.




On Thu, May 1, 2014 at 1:39 AM, Dmitriy Lyubimov <[email protected]> wrote:

This is interesting. And this happens in one node. Can it be decoupled from
parallelization concerns and re-used? (proposal D)


On Wed, Apr 30, 2014 at 4:09 PM, Ted Dunning <[email protected]>
wrote:

I should add that the way that the compression is done is pretty cool for
speed.  The basic idea is that byte code engineering is used to directly
inject the decompression and compression code into the user code.  This
allows format conditionals to be hoisted outside the parallel loop
entirely.  This drops decompression overhead to just a few cycles.  This
is
necessary because the point is to allow the inner loop to proceed at L1
speeds instead of L3 speeds (really L3 / compression ratio).



On Thu, May 1, 2014 at 12:35 AM, Dmitriy Lyubimov <[email protected]>
wrote:

On Wed, Apr 30, 2014 at 3:24 PM, Ted Dunning <[email protected]>
wrote:

Inline


On Wed, Apr 30, 2014 at 8:25 PM, Dmitriy Lyubimov <[email protected]
wrote:

On Wed, Apr 30, 2014 at 7:06 AM, Ted Dunning <
[email protected]>
wrote:

My motivation to accept comes from the fact that they have
machine
learning
codes that are as fast as what google has internally.  They
completely
crush all of the spark efforts on speed.

correct me if i am wrong. h20 performance strengths come from speed
of
in-core computations and efficient compression (that's what i heard
at
least).

Those two factors are key.  In addition, the ability to dispatch
parallel
computations with microsecond latencies is also important as well as
the
ability to transparently communicate at high speeds between processes
both
local and remote.

This is kind of an old news.  They all do, for years now. I've been
building a system that does real time distributed pipelines (~30 ms to
start all steps in pipeline + in-core complexity)  for years.  Note
that
node-to-node hop in clouds are usually mean at about 10ms so
microseconds
are kind of out of question for network performance reasons in real
life
except for private racks.

The only thing that doesn't do this is the MR variety of Hadoop.


Reply via email to