Yeah I worked on DistributedR while I was an intern at HP Labs, but it has evolved a lot since then. I don't think its a direct comparison as DistributedR is a pure R implementation in a distributed setting while SparkR is a wrapper around the Scala / Java implementations in Spark.
That said, it would be an interesting exercise to compare them and I hope to do it at some point. Shivaram On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin <r...@databricks.com> wrote: > Actually I believe the same person started both projects. > > The Distributed R project from HP was started by Shivaram Venkataraman when > he was there. He since moved to Berkeley AMPLab to pursue a PhD and SparkR > was his latest project. > > > > On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > > > On a related note, I recently heard about Distributed R > > <https://github.com/vertica/DistributedR>, which is coming out of > > HP/Vertica and seems to be their proposition for machine learning at > scale. > > > > It would be interesting to see some kind of comparison between that and > > MLlib (and perhaps also SparkR > > <https://github.com/amplab-extras/SparkR-pkg>?), especially since > > Distributed R has a concept of distributed arrays and works on data > > in-memory. Docs are here. > > <https://github.com/vertica/DistributedR/tree/master/doc/platform> > > > > Nick > > > > > > On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin <r...@databricks.com> > wrote: > > > >> They only compared their own implementations of couple algorithms on > >> different platforms rather than comparing the different platforms > >> themselves (in the case of Spark -- PySpark). I can write two variants > of > >> an algorithm on Spark and make them perform drastically differently. > >> > >> I have no doubt if you implement a ML algorithm in Python itself without > >> any native libraries, the performance will be sub-optimal. > >> > >> What PySpark really provides is: > >> > >> - Using Spark transformations in Python > >> - ML algorithms implemented in Scala (leveraging native numerical > >> libraries > >> for high performance), and callable in Python > >> > >> The paper claims "Python is now one of the most popular languages for > >> ML-oriented programming", and that's why they went ahead with Python. > >> However, as I understand, very few people actually implement algorithms > in > >> Python directly because of the sub-optimal performance. Most people > >> implement algorithms in other languages (e.g. C / Java), and expose APIs > >> in > >> Python for ease-of-use. This is what we are trying to do with PySpark as > >> well. > >> > >> > >> On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas < > >> ignacio.zendejas...@gmail.com> wrote: > >> > >> > Has anyone had a chance to look at this paper (with title in subject)? > >> > http://www.cs.rice.edu/~lp6/comparison.pdf > >> > > >> > Interesting that they chose to use Python alone. Do we know how much > >> faster > >> > Scala is vs. Python in general, if at all? > >> > > >> > As with any and all benchmarks, I'm sure there are caveats, but it'd > be > >> > nice to have a response to the question above for starters. > >> > > >> > Thanks, > >> > Ignacio > >> > > >> > > > > >