They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently.
I have no doubt if you implement a ML algorithm in Python itself without any native libraries, the performance will be sub-optimal. What PySpark really provides is: - Using Spark transformations in Python - ML algorithms implemented in Scala (leveraging native numerical libraries for high performance), and callable in Python The paper claims "Python is now one of the most popular languages for ML-oriented programming", and that's why they went ahead with Python. However, as I understand, very few people actually implement algorithms in Python directly because of the sub-optimal performance. Most people implement algorithms in other languages (e.g. C / Java), and expose APIs in Python for ease-of-use. This is what we are trying to do with PySpark as well. On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas < ignacio.zendejas...@gmail.com> wrote: > Has anyone had a chance to look at this paper (with title in subject)? > http://www.cs.rice.edu/~lp6/comparison.pdf > > Interesting that they chose to use Python alone. Do we know how much faster > Scala is vs. Python in general, if at all? > > As with any and all benchmarks, I'm sure there are caveats, but it'd be > nice to have a response to the question above for starters. > > Thanks, > Ignacio >