Greetings, 

I worked with the theory of SVMs during my Graduate studies and I’m relatively 
new to existing ML software. Assuming that I want to create new scalable ML 
algorithms starting with the Math, the question is: how do scikit-learn, Mahout 
Samsara and SystemML compare to each other?

I see interesting Python-based frameworks such as scikit-learn, but then I read 
SystemML's article on Wikipedia that made me question the distributive 
scalability of (“pure") Python for large amounts of data:

"[...] It was observed that data scientists would write machine learning 
algorithms in languages such as R and Python for small data. When it came time 
to scale to big data, a systems programmer would be needed to scale the 
algorithm in a language such as Scala. This process typically involved days or 
weeks per iteration, and errors would occur translating the algorithms to 
operate on big data. " ( https://en.wikipedia.org/wiki/Apache_SystemML )

And the article starts stating that Apache SystemML has "algorithm 
customizability via [...] Python-like languages”.

Mahout Samsara is based on Scala. PredictionIO 
(predictionio.incubator.apache.org) algorithms are based on Mahout Samsara and 
Scala.  I asked Mr. Matthias Boehm at a conference how one could compare Mahout 
Samsara to SystemML. From what I understood, Samsara needs "explicit 
declarations” in expressions for distributed computing, while SystemML doesn’t 
— please correct me if I’m wrong. Also, SystemML will optimize the entire 
script, while Samsara will optimize expressions — again, please correct me if 
I’m wrong.

While my main criterion is scalability (cluster, GPU support etc), other 
criteria to evaluate these frameworks may be: a) public adoption, b) active dev 
community, c) quality of tools for development, d) backing of big companies e) 
simplicity working with clusters (delegating the complexities of clustering to 
the framework, “hiding” them from the user), f) quality of documentation, g) 
quality of the software itself

( My question was deleted from stats.stackexchange.com for being off-topic and 
deleted from Stack Overflow for being bound to get answers with "opinions 
rather than facts” [sic]. I’m very much interested in hearing balanced and 
insightful comments from the list. )

Thank you,

Gustavo

Reply via email to