Naw - the KMeans was strictly as a "sample" attempt to blend the H2O & Mahout coding.

Goal really is to get feedback from this group on how well that attempt is working.
Is there a better API?  What is it?
What can be improved?
How clumsy is the current marrage of H2OMatrix vs Matrix?
What's the mental cost of H2O's "tall skinny data" vs Mahout's All-The-Worlds-A-(squarish)-Matrix model?

Right now we're working on cleaning up the H2O internal DSL to make it better support either Spark/Scala and/or Dmitriy's DSL - plus also our commitment to running R. I'm hoping Mahout volunteers will peek at it

  
https://github.com/tdunning/h2o-matrix/blob/master/src/main/java/ai/h2o/algo/KMeans.java

and comment before we go further down this path.

If - ultimately - Mahout decides to drop the current Matrix API for something more bulk/scale/parallel friendly - we're happy to go along with that also.

Cliff


On 5/1/2014 9:01 AM, Pat Ferrel wrote:
Odd that the Kmeans implementation isn’t a way to demonstrate performance. 
Seems like anyone could grab that and try it with the same data on MLlib and 
perform a principled analysis. Or just run the same data through h2o and MLlib. 
This seems like a good way to look at the forrest instead of the trees.

BTW any generalization effort to support two execution engines will have to 
abstract away the SparkContext. This is where IO, job control, and engine 
tuning happens. Abstracting the DSL is not sufficient. Any hypothetical 
MahoutContext (a good idea for sure) if it deviated significantly from a 
SparkContext will have broad impact.

http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkContext



Reply via email to