Naw - the KMeans was strictly as a "sample" attempt to blend the H2O &
Mahout coding.
Goal really is to get feedback from this group on how well that attempt
is working.
Is there a better API? What is it?
What can be improved?
How clumsy is the current marrage of H2OMatrix vs Matrix?
What's the mental cost of H2O's "tall skinny data" vs Mahout's
All-The-Worlds-A-(squarish)-Matrix model?
Right now we're working on cleaning up the H2O internal DSL to make it
better support either Spark/Scala and/or Dmitriy's DSL - plus also our
commitment to running R. I'm hoping Mahout volunteers will peek at it
https://github.com/tdunning/h2o-matrix/blob/master/src/main/java/ai/h2o/algo/KMeans.java
and comment before we go further down this path.
If - ultimately - Mahout decides to drop the current Matrix API for
something more bulk/scale/parallel friendly - we're happy to go along
with that also.
Cliff
On 5/1/2014 9:01 AM, Pat Ferrel wrote:
Odd that the Kmeans implementation isn’t a way to demonstrate performance.
Seems like anyone could grab that and try it with the same data on MLlib and
perform a principled analysis. Or just run the same data through h2o and MLlib.
This seems like a good way to look at the forrest instead of the trees.
BTW any generalization effort to support two execution engines will have to
abstract away the SparkContext. This is where IO, job control, and engine
tuning happens. Abstracting the DSL is not sufficient. Any hypothetical
MahoutContext (a good idea for sure) if it deviated significantly from a
SparkContext will have broad impact.
http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkContext