Thank you Cliff, this is excellent. Obviously you guys invested a lot of time and private money in engineering aspects of vector serialization and instrumenting byte codes.
What i really want to ask you is this. I asked this question before, and did not get answer. But I assume now you know a bit more about Mahout. What is your opinion/vision on actually _integrating_ with Mahout? Integration effort in my definition would be (1) reusing some of Mahout implementations, and/or (2) helping some of mahout algorithms/components to do their job better. What you have been doing to date was something roughly amounting to building (porting) Mahout (or non-Mahout) algorithms for H20. This, by definition, is not an integration effort and could happily run forever without ever requiring a Mahout commit. I would be interested to hear your thoughts again on what you think it means to _integrate_ with Mahout. On Thu, May 1, 2014 at 8:40 AM, Cliff Click <[email protected]> wrote: > H2O will launch an internal Task in the single-digit microsecond range. > Because of this, we can launch 100,000's (millions?) a second... leading > to fine-grained data parallelism, and high CPU utilization. This is a big > piece of our single-node speed. Some other distributed Task-launching > solutions I've seen tend to require a network-hop per-task... leading to > your 10ms to launch as task requirement, leading to a limit of a few 1000 > Tasks/sec requiring tasks that are much larger and coarser than H2O's... > leading to much lower CPU utilization. > > Also, I'm getting 200micro-second ping's between my datacenter > machines.... down from 10msec. It's decent commodity hardware, nothing > special. Meaning: H2O can launch task on an entire 32-node cluster in > about 1msec, starting from a single driving node (log-tree fanout, depth 5, > 200micro-second single UDP packet launch, 1micro-second internal task > launch). > > And this latency matters when the work itself is lots and lots "small" > jobs, as is common when a DSL such as Mahout or Spark/Scala or R is driving > simple operators over bulk data. > > Cliff > > > > On 4/30/2014 3:35 PM, Dmitriy Lyubimov wrote: > >> This is kind of an old news. They all do, for years now. I've been >> building a system that does real time distributed pipelines (~30 ms to >> start all steps in pipeline + in-core complexity) for years. Note that >> node-to-node hop in clouds are usually mean at about 10ms so microseconds >> are kind of out of question for network performance reasons in real life >> except for private racks. The only thing that doesn't do this is the MR >> variety of Hadoop. >> > >
