OK, then how about h2o kmeans v MLlib?

On May 1, 2014, at 9:20 AM, Cliff Click <[email protected]> wrote:

Naw - the KMeans was strictly as a "sample" attempt to blend the H2O & Mahout 
coding.

Goal really is to get feedback from this group on how well that attempt is 
working.
Is there a better API?  What is it?
What can be improved?
How clumsy is the current marrage of H2OMatrix vs Matrix?
What's the mental cost of H2O's "tall skinny data" vs Mahout's 
All-The-Worlds-A-(squarish)-Matrix model?

Right now we're working on cleaning up the H2O internal DSL to make it better 
support either Spark/Scala and/or Dmitriy's DSL - plus also our commitment to 
running R.  I'm hoping Mahout volunteers will peek at it

 
https://github.com/tdunning/h2o-matrix/blob/master/src/main/java/ai/h2o/algo/KMeans.java

and comment before we go further down this path.

If - ultimately - Mahout decides to drop the current Matrix API for something 
more bulk/scale/parallel friendly - we're happy to go along with that also.

Cliff


On 5/1/2014 9:01 AM, Pat Ferrel wrote:
> Odd that the Kmeans implementation isn’t a way to demonstrate performance. 
> Seems like anyone could grab that and try it with the same data on MLlib and 
> perform a principled analysis. Or just run the same data through h2o and 
> MLlib. This seems like a good way to look at the forrest instead of the trees.
> 
> BTW any generalization effort to support two execution engines will have to 
> abstract away the SparkContext. This is where IO, job control, and engine 
> tuning happens. Abstracting the DSL is not sufficient. Any hypothetical 
> MahoutContext (a good idea for sure) if it deviated significantly from a 
> SparkContext will have broad impact.
> 
> http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkContext
> 
> 


Reply via email to