Odd that the Kmeans implementation isn’t a way to demonstrate performance. 
Seems like anyone could grab that and try it with the same data on MLlib and 
perform a principled analysis. Or just run the same data through h2o and MLlib. 
This seems like a good way to look at the forrest instead of the trees.

BTW any generalization effort to support two execution engines will have to 
abstract away the SparkContext. This is where IO, job control, and engine 
tuning happens. Abstracting the DSL is not sufficient. Any hypothetical 
MahoutContext (a good idea for sure) if it deviated significantly from a 
SparkContext will have broad impact.

http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkContext


On May 1, 2014, at 8:40 AM, Cliff Click <[email protected]> wrote:

H2O will launch an internal Task in the single-digit microsecond range.  
Because of this, we can launch 100,000's (millions?) a second... leading to 
fine-grained data parallelism, and high CPU utilization.  This is a big piece 
of our single-node speed.  Some other distributed Task-launching solutions I've 
seen tend to require a network-hop per-task... leading to your 10ms to launch 
as task requirement, leading to a limit of a few 1000 Tasks/sec requiring tasks 
that are much larger and coarser than H2O's... leading to much lower CPU 
utilization.

Also, I'm getting 200micro-second ping's between my datacenter machines.... 
down from 10msec.  It's decent commodity hardware, nothing special.  Meaning: 
H2O can launch task on an entire 32-node cluster in about 1msec, starting from 
a single driving node (log-tree fanout, depth 5, 200micro-second single UDP 
packet launch, 1micro-second internal task launch).

And this latency matters when the work itself is lots and lots "small" jobs, as 
is common when a DSL such as Mahout or Spark/Scala or R is driving simple 
operators over bulk data.

Cliff


On 4/30/2014 3:35 PM, Dmitriy Lyubimov wrote:
> This is kind of an old news. They all do, for years now. I've been building a 
> system that does real time distributed pipelines (~30 ms to start all steps 
> in pipeline + in-core complexity) for years. Note that node-to-node hop in 
> clouds are usually mean at about 10ms so microseconds are kind of out of 
> question for network performance reasons in real life except for private 
> racks. The only thing that doesn't do this is the MR variety of Hadoop. 


Reply via email to