Odd that the Kmeans implementation isn’t a way to demonstrate performance. Seems like anyone could grab that and try it with the same data on MLlib and perform a principled analysis. Or just run the same data through h2o and MLlib. This seems like a good way to look at the forrest instead of the trees.
BTW any generalization effort to support two execution engines will have to abstract away the SparkContext. This is where IO, job control, and engine tuning happens. Abstracting the DSL is not sufficient. Any hypothetical MahoutContext (a good idea for sure) if it deviated significantly from a SparkContext will have broad impact. http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkContext On May 1, 2014, at 8:40 AM, Cliff Click <[email protected]> wrote: H2O will launch an internal Task in the single-digit microsecond range. Because of this, we can launch 100,000's (millions?) a second... leading to fine-grained data parallelism, and high CPU utilization. This is a big piece of our single-node speed. Some other distributed Task-launching solutions I've seen tend to require a network-hop per-task... leading to your 10ms to launch as task requirement, leading to a limit of a few 1000 Tasks/sec requiring tasks that are much larger and coarser than H2O's... leading to much lower CPU utilization. Also, I'm getting 200micro-second ping's between my datacenter machines.... down from 10msec. It's decent commodity hardware, nothing special. Meaning: H2O can launch task on an entire 32-node cluster in about 1msec, starting from a single driving node (log-tree fanout, depth 5, 200micro-second single UDP packet launch, 1micro-second internal task launch). And this latency matters when the work itself is lots and lots "small" jobs, as is common when a DSL such as Mahout or Spark/Scala or R is driving simple operators over bulk data. Cliff On 4/30/2014 3:35 PM, Dmitriy Lyubimov wrote: > This is kind of an old news. They all do, for years now. I've been building a > system that does real time distributed pipelines (~30 ms to start all steps > in pipeline + in-core complexity) for years. Note that node-to-node hop in > clouds are usually mean at about 10ms so microseconds are kind of out of > question for network performance reasons in real life except for private > racks. The only thing that doesn't do this is the MR variety of Hadoop.
