They only compared their own implementations of couple algorithms on
different platforms rather than comparing the different platforms
themselves (in the case of Spark -- PySpark). I can write two variants of
an algorithm on Spark and make them perform drastically differently.

I have no doubt if you implement a ML algorithm in Python itself without
any native libraries, the performance will be sub-optimal.

What PySpark really provides is:

- Using Spark transformations in Python
- ML algorithms implemented in Scala (leveraging native numerical libraries
for high performance), and callable in Python

The paper claims "Python is now one of the most popular languages for
ML-oriented programming", and that's why they went ahead with Python.
However, as I understand, very few people actually implement algorithms in
Python directly because of the sub-optimal performance. Most people
implement algorithms in other languages (e.g. C / Java), and expose APIs in
Python for ease-of-use. This is what we are trying to do with PySpark as
well.


On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas <
ignacio.zendejas...@gmail.com> wrote:

> Has anyone had a chance to look at this paper (with title in subject)?
> http://www.cs.rice.edu/~lp6/comparison.pdf
>
> Interesting that they chose to use Python alone. Do we know how much faster
> Scala is vs. Python in general, if at all?
>
> As with any and all benchmarks, I'm sure there are caveats, but it'd be
> nice to have a response to the question above for starters.
>
> Thanks,
> Ignacio
>

Reply via email to