Thank you Cliff, this is excellent. Obviously you guys invested a lot of
time and private money in engineering aspects of vector serialization and
instrumenting byte codes.

What i really want to ask you is this. I asked this question before, and
did not get answer. But I assume now you know a bit more about Mahout.

What is your opinion/vision on actually _integrating_ with Mahout?

Integration effort in my definition would be (1) reusing some of Mahout
implementations, and/or (2) helping some of mahout algorithms/components to
do their job better.

What you have been doing to date was something roughly amounting to
 building (porting) Mahout (or non-Mahout) algorithms for H20. This, by
definition, is not an integration effort and could happily run forever
without ever requiring a Mahout commit.

I would be interested to hear your thoughts again on what you think it
means to _integrate_ with Mahout.



On Thu, May 1, 2014 at 8:40 AM, Cliff Click <[email protected]> wrote:

> H2O will launch an internal Task in the single-digit microsecond range.
>  Because of this, we can launch 100,000's (millions?) a second... leading
> to fine-grained data parallelism, and high CPU utilization.  This is a big
> piece of our single-node speed.  Some other distributed Task-launching
> solutions I've seen tend to require a network-hop per-task... leading to
> your 10ms to launch as task requirement, leading to a limit of a few 1000
> Tasks/sec requiring tasks that are much larger and coarser than H2O's...
> leading to much lower CPU utilization.
>
> Also, I'm getting 200micro-second ping's between my datacenter
> machines.... down from 10msec.  It's decent commodity hardware, nothing
> special.  Meaning: H2O can launch task on an entire 32-node cluster in
> about 1msec, starting from a single driving node (log-tree fanout, depth 5,
> 200micro-second single UDP packet launch, 1micro-second internal task
> launch).
>
> And this latency matters when the work itself is lots and lots "small"
> jobs, as is common when a DSL such as Mahout or Spark/Scala or R is driving
> simple operators over bulk data.
>
> Cliff
>
>
>
> On 4/30/2014 3:35 PM, Dmitriy Lyubimov wrote:
>
>> This is kind of an old news. They all do, for years now. I've been
>> building a system that does real time distributed pipelines (~30 ms to
>> start all steps in pipeline + in-core complexity) for years. Note that
>> node-to-node hop in clouds are usually mean at about 10ms so microseconds
>> are kind of out of question for network performance reasons in real life
>> except for private racks. The only thing that doesn't do this is the MR
>> variety of Hadoop.
>>
>
>

Reply via email to