On Tue, Nov 12, 2013 at 4:02 PM, Manuel Blechschmidt <[email protected]> wrote: > It would be nice if Cloudera could publish some benchmarks. Cloudera vs. > Mahout vs. SAP HANA PAL vs. SPSS to give somebody the chances to enhance > Mahout in a way that it can catch up.
Does this need to be a "versus" thing? I and other engs here did a fair bit of work to keep the Mahout code working in CDH5 / Hadoop 2.2, and contributed that back. For a company apparently trying to undermine Mahout we're not very good at it... I like the benchmark sentiment. The two projects actually have little overlap in functionality, which is the essence of the reason why it's a different project. Oryx has nothing but RDF, kmeans++, and ALS. No visualization, no text processing tools. No library-like interfaces. On the other hand the piece of the puzzle Oryx is trying to add (model serving) has no counterpart in this project, with possible exception of Taste. So there's not much to compare with a benchmark. In-memory pretty well always beats Hadoop. I can tell you that I think the ALS in Mahout is *faster* I'm pretty sure mostly for loading a bunch into memory. But the in-memory ALS in Oryx of course is faster by an order of magnitude than both. How do you want to benchmark that? I have never used SPSS or HANA's offering here, but am willing to bet it's wicked fast without even bothering to measure. I'm not even sure speed is the only or main point? Things like usability out of the box top my list. And being open source and working with data in HDFS.
