On a related note, I recently heard about Distributed R <https://github.com/vertica/DistributedR>, which is coming out of HP/Vertica and seems to be their proposition for machine learning at scale.
It would be interesting to see some kind of comparison between that and MLlib (and perhaps also SparkR <https://github.com/amplab-extras/SparkR-pkg>?), especially since Distributed R has a concept of distributed arrays and works on data in-memory. Docs are here. <https://github.com/vertica/DistributedR/tree/master/doc/platform> Nick On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin <r...@databricks.com> wrote: > They only compared their own implementations of couple algorithms on > different platforms rather than comparing the different platforms > themselves (in the case of Spark -- PySpark). I can write two variants of > an algorithm on Spark and make them perform drastically differently. > > I have no doubt if you implement a ML algorithm in Python itself without > any native libraries, the performance will be sub-optimal. > > What PySpark really provides is: > > - Using Spark transformations in Python > - ML algorithms implemented in Scala (leveraging native numerical libraries > for high performance), and callable in Python > > The paper claims "Python is now one of the most popular languages for > ML-oriented programming", and that's why they went ahead with Python. > However, as I understand, very few people actually implement algorithms in > Python directly because of the sub-optimal performance. Most people > implement algorithms in other languages (e.g. C / Java), and expose APIs in > Python for ease-of-use. This is what we are trying to do with PySpark as > well. > > > On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas < > ignacio.zendejas...@gmail.com> wrote: > > > Has anyone had a chance to look at this paper (with title in subject)? > > http://www.cs.rice.edu/~lp6/comparison.pdf > > > > Interesting that they chose to use Python alone. Do we know how much > faster > > Scala is vs. Python in general, if at all? > > > > As with any and all benchmarks, I'm sure there are caveats, but it'd be > > nice to have a response to the question above for starters. > > > > Thanks, > > Ignacio > > >