Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

Shivaram Venkataraman Wed, 13 Aug 2014 14:22:02 -0700

Yeah I worked on DistributedR while I was an intern at HP Labs, but it has
evolved a lot since then. I don't think its a direct comparison as
DistributedR is a pure R implementation in a distributed setting while
SparkR is a wrapper around the Scala / Java implementations in Spark.


That said, it would be an interesting exercise to compare them and I hope
to do it at some point.

Shivaram


On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin <r...@databricks.com> wrote:

> Actually I believe the same person started both projects.
>
> The Distributed R project from HP was started by Shivaram Venkataraman when
> he was there. He since moved to Berkeley AMPLab to pursue a PhD and SparkR
> was his latest project.
>
>
>
> On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> > On a related note, I recently heard about Distributed R
> > <https://github.com/vertica/DistributedR>, which is coming out of
> > HP/Vertica and seems to be their proposition for machine learning at
> scale.
> >
> > It would be interesting to see some kind of comparison between that and
> > MLlib (and perhaps also SparkR
> > <https://github.com/amplab-extras/SparkR-pkg>?), especially since
> > Distributed R has a concept of distributed arrays and works on data
> > in-memory. Docs are here.
> > <https://github.com/vertica/DistributedR/tree/master/doc/platform>
> >
> > Nick
> >
> >
> > On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin <r...@databricks.com>
> wrote:
> >
> >> They only compared their own implementations of couple algorithms on
> >> different platforms rather than comparing the different platforms
> >> themselves (in the case of Spark -- PySpark). I can write two variants
> of
> >> an algorithm on Spark and make them perform drastically differently.
> >>
> >> I have no doubt if you implement a ML algorithm in Python itself without
> >> any native libraries, the performance will be sub-optimal.
> >>
> >> What PySpark really provides is:
> >>
> >> - Using Spark transformations in Python
> >> - ML algorithms implemented in Scala (leveraging native numerical
> >> libraries
> >> for high performance), and callable in Python
> >>
> >> The paper claims "Python is now one of the most popular languages for
> >> ML-oriented programming", and that's why they went ahead with Python.
> >> However, as I understand, very few people actually implement algorithms
> in
> >> Python directly because of the sub-optimal performance. Most people
> >> implement algorithms in other languages (e.g. C / Java), and expose APIs
> >> in
> >> Python for ease-of-use. This is what we are trying to do with PySpark as
> >> well.
> >>
> >>
> >> On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas <
> >> ignacio.zendejas...@gmail.com> wrote:
> >>
> >> > Has anyone had a chance to look at this paper (with title in subject)?
> >> > http://www.cs.rice.edu/~lp6/comparison.pdf
> >> >
> >> > Interesting that they chose to use Python alone. Do we know how much
> >> faster
> >> > Scala is vs. Python in general, if at all?
> >> >
> >> > As with any and all benchmarks, I'm sure there are caveats, but it'd
> be
> >> > nice to have a response to the question above for starters.
> >> >
> >> > Thanks,
> >> > Ignacio
> >> >
> >>
> >
> >
>

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

Reply via email to