Re: Straw poll re: H2O ?

Cliff Click Thu, 01 May 2014 08:48:48 -0700

Wow, the mahout mailing list is busy. Pardon me if I don't keep up withall that flows thru....

The compression/decompression stuff is tightly coupled with a few partsof H2O.The CSV file parser does compression on data-inhale, and this is crucialin parsing datasets that fit in DRAM - after compression, but not beforecompression - without requiring an intermediate spill-to-disk. (Also"CSV" is loosely defined to cover any line/field oriented text file,including e.g. Hive files, tab-separated, etc)

The data is aligned across the cluster in a way to allow fast parallelaccess in the common cases of e.g. R data-frame-like access. Theexpected column-count is expected to be modest - a few 1000's to a maxof perhaps a million. No limit on rows (other than what fits in DRAM).The compression follows the alignment in DRAM.

If you want "just the compression", you'll have to take thecolumn-alignment and distribution also.

Ted mentioned byte-code injection - H2O does that for serialization ofPOJO's (Plain Olde Java Objects). I'll warrant that H2O has the fastestserialize/deserialize path on the planet (with heavy compression of POJOprimitive arrays specifically for ML algorithms)... but it's notdirectly tied to the compression of Big Data in ram. The next step upin speed for decompressing the Big Data WOULD be do to byte-codeinjection, however. So far we're outrunning memory bandwidth and don'tneed BCI for the Big Data (but clearly need it for the POJOs).


Cliff


On 4/30/2014 4:49 PM, Ted Dunning wrote:

I couldn't say.

Let's invite the 0xdata to show us what can happen.




On Thu, May 1, 2014 at 1:39 AM, Dmitriy Lyubimov <[email protected]> wrote:

This is interesting. And this happens in one node. Can it be decoupled from
parallelization concerns and re-used? (proposal D)


On Wed, Apr 30, 2014 at 4:09 PM, Ted Dunning <[email protected]>
wrote:

I should add that the way that the compression is done is pretty cool for
speed.  The basic idea is that byte code engineering is used to directly
inject the decompression and compression code into the user code.  This
allows format conditionals to be hoisted outside the parallel loop
entirely.  This drops decompression overhead to just a few cycles.  This

is

necessary because the point is to allow the inner loop to proceed at L1
speeds instead of L3 speeds (really L3 / compression ratio).



On Thu, May 1, 2014 at 12:35 AM, Dmitriy Lyubimov <[email protected]>
wrote:

On Wed, Apr 30, 2014 at 3:24 PM, Ted Dunning <[email protected]>
wrote:

Inline


On Wed, Apr 30, 2014 at 8:25 PM, Dmitriy Lyubimov <[email protected]
wrote:

On Wed, Apr 30, 2014 at 7:06 AM, Ted Dunning <

[email protected]>

wrote:

My motivation to accept comes from the fact that they have

machine

learning

codes that are as fast as what google has internally.  They

completely

crush all of the spark efforts on speed.

correct me if i am wrong. h20 performance strengths come from speed

of

in-core computations and efficient compression (that's what i heard

at

least).

Those two factors are key.  In addition, the ability to dispatch

parallel

computations with microsecond latencies is also important as well as

the

ability to transparently communicate at high speeds between processes

both

local and remote.

This is kind of an old news.  They all do, for years now. I've been
building a system that does real time distributed pipelines (~30 ms to
start all steps in pipeline + in-core complexity)  for years.  Note

that

node-to-node hop in clouds are usually mean at about 10ms so

microseconds

are kind of out of question for network performance reasons in real

life

except for private racks.

The only thing that doesn't do this is the MR variety of Hadoop.

Re: Straw poll re: H2O ?

Reply via email to