Re: Straw poll re: H2O ?

Cliff Click Thu, 01 May 2014 09:36:43 -0700

Naw - the KMeans was strictly as a "sample" attempt to blend the H2O &Mahout coding.

Goal really is to get feedback from this group on how well that attemptis working.

Is there a better API?  What is it?
What can be improved?
How clumsy is the current marrage of H2OMatrix vs Matrix?

What's the mental cost of H2O's "tall skinny data" vs Mahout'sAll-The-Worlds-A-(squarish)-Matrix model?

Right now we're working on cleaning up the H2O internal DSL to make itbetter support either Spark/Scala and/or Dmitriy's DSL - plus also ourcommitment to running R. I'm hoping Mahout volunteers will peek at it


  
https://github.com/tdunning/h2o-matrix/blob/master/src/main/java/ai/h2o/algo/KMeans.java

and comment before we go further down this path.

If - ultimately - Mahout decides to drop the current Matrix API forsomething more bulk/scale/parallel friendly - we're happy to go alongwith that also.


Cliff


On 5/1/2014 9:01 AM, Pat Ferrel wrote:

Odd that the Kmeans implementation isn’t a way to demonstrate performance. 
Seems like anyone could grab that and try it with the same data on MLlib and 
perform a principled analysis. Or just run the same data through h2o and MLlib. 
This seems like a good way to look at the forrest instead of the trees.

BTW any generalization effort to support two execution engines will have to 
abstract away the SparkContext. This is where IO, job control, and engine 
tuning happens. Abstracting the DSL is not sufficient. Any hypothetical 
MahoutContext (a good idea for sure) if it deviated significantly from a 
SparkContext will have broad impact.

http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkContext

Re: Straw poll re: H2O ?

Reply via email to