H2O will launch an internal Task in the single-digit microsecond range.
Because of this, we can launch 100,000's (millions?) a second... leading
to fine-grained data parallelism, and high CPU utilization. This is a
big piece of our single-node speed. Some other distributed
Task-launching solutions I've seen tend to require a network-hop
per-task... leading to your 10ms to launch as task requirement, leading
to a limit of a few 1000 Tasks/sec requiring tasks that are much larger
and coarser than H2O's... leading to much lower CPU utilization.
Also, I'm getting 200micro-second ping's between my datacenter
machines.... down from 10msec. It's decent commodity hardware, nothing
special. Meaning: H2O can launch task on an entire 32-node cluster in
about 1msec, starting from a single driving node (log-tree fanout, depth
5, 200micro-second single UDP packet launch, 1micro-second internal task
launch).
And this latency matters when the work itself is lots and lots "small"
jobs, as is common when a DSL such as Mahout or Spark/Scala or R is
driving simple operators over bulk data.
Cliff
On 4/30/2014 3:35 PM, Dmitriy Lyubimov wrote:
This is kind of an old news. They all do, for years now. I've been
building a system that does real time distributed pipelines (~30 ms to
start all steps in pipeline + in-core complexity) for years. Note that
node-to-node hop in clouds are usually mean at about 10ms so
microseconds are kind of out of question for network performance
reasons in real life except for private racks. The only thing that
doesn't do this is the MR variety of Hadoop.