Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-09-20 Thread Seraph
I’m also one of the authors of this paper and I am responsible for the Spark
experiments in this paper. Thank you for your guys discussion!

(1)

Ignacio Zendejas wrote
 I should rephrase my question as it was poorly phrased: on average, how 
 much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? 
 I've only used Spark and don't have a chance to test this at the moment so 
 if anybody has these numbers or general estimates (10x, etc), that'd be 
 great. 


Davies Liu wrote
 A quick comparison by word count on 4.3G text file (local mode), 
 
 Spark:  40 seconds 
 PySpark: 2 minutes and 16 seconds 
 
 So PySpark is 3.4x slower than Spark. 

From my perspective, it is a difficult task to compare the speed between
scala  python, or spark  pyspark. Simple examples may not be enough
for us to draw a conclusion on such comparison. We may need more complex
models for testing to obtain more comprehensive ideas. It is possible that
spark is fast in some applications, but it is slower than pyspark in others.
So the speed issue should be application specific. It is also one of the
purpose for our paper: shed some light such benchmarks for
platform/performance comparison.

(2)

Matei Zaharia wrote
 Just as a note on this paper, apart from implementing the algorithms in
 naive Python, they also run it in a fairly inefficient way. In particular
 their implementations send the model out with every task closure, which is
 really expensive for a large model, and bring it back with collectAsMap().
 It would be much more efficient to send it e.g. with
 SparkContext.broadcast() or keep it distributed on the cluster throughout
 the computation, instead of making the drive node a bottleneck for
 communication. 

We have tried our best to write several implementation methods for each
model, in order to pick up the optimal one. Some functions may seem
promising, but they fail when we did our experiments. Broadcast() is a good
idea, and we can try it for our models to see if it can bring much
difference. But as cjermaine said, broadcast models should not be a
bottleneck because the models are small int all experiments. Also, we may
change the parameters of models in each iteration, so the one time
broadcast may not provide so much help as expected. Moreover, we are really
careful when we use collect() or collectAsMap(). In our experiments, we do
not collect large sets of data by collectAsMap(), and it does not consume
much time either. 

Overall, I have no doubt that Spark developers can write more efficient code
for our models, and it is very welcome that some Spark developers can
provide better implementations for our experiments.

Thanks!




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-Comparison-of-Platforms-for-Implementing-and-Running-Very-Large-Scale-Machine-Learning-Algorithms-tp7823p8485.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-09-07 Thread cjermaine

I’m out of the authors of this paper, and I just came across this thread.
I’m glad that Ignacio Zendejas noticed our paper!

First off, let me post link to the published version of the paper, which is
likely slightly different than the version linked above:

http://cmj4.web.rice.edu/performance.pdf
http://cmj4.web.rice.edu/performance.pdf  

Next, I just want to quickly address a couple of comments made here.

rxin says:

 They only compared their own implementations of couple algorithms on 
 different platforms rather than comparing the different platforms 
 themselves (in the case of Spark -- PySpark). I can write two variants of 
 an algorithm on Spark and make them perform drastically differently. 

It’s a bit misleading to say that we just tried a “couple” of algorithms;
the paper describes five different algorithms, along with multiple
implementations on each; we tried a LOT of variants of each algorithm, as we
detail in the paper.

Also, it is true that we did use our own implementations; the point was to
compare each platform as a programming and execution platform. The paper is
clear that the benchmark was directed towards “a user who wants to run a
specific ML inference algorithm over a large data set, but cannot find an
existing implementation and thus must 'roll her own' ML code.” We
specifically state that we are not interested in comparing canned libraries,
which is a very different task. Both ease-of-use and speed were considered
as being equally important. If you read the paper, you’ll see that we
generally gave PySpark high marks as a programming platform

Several of the posts here imply that all Spark experiments were using
PySpark. This is not true. Matei Zaharia says:

 Just as a note on this paper, apart from implementing the algorithms in
 naive Python…

And Ignacio Zendajas says:

 Interesting that they chose to use Python alone…

In reality, the paper also describes pure Java implementations that ran on
Spark, with no Python, for two models: a Gaussian mixture model and LDA. 

For the GMM, Java ran in 40-50% of the time compared to Python (though to be
fair, a lot of that is due to the fact that GMM inference is
linear-algebra-intensive; it’s not easy to do linear algebra in the JVM for
reasons I’ll not get into here… it’s possible someone else could do a lot
better). On LDA, Java was less than 10% of the Python time (that is, much
faster).

Matai Zaharia also says:

 they also run it in a fairly inefficient way. In particular 
 their implementations send the model out with every task closure, which is 
 really expensive for a large model, and bring it back with collectAsMap(). 

It’s certainly possible our implementations were sub-optimal. All I can say
is that again, the goal of the paper was to chronicle our experiences using
each platform as both a programming and execution platform. While doubtless
Spark’s developers could have done a better job of writing code for Spark, I
hope we didn’t do too badly! And again, evaluating ease-of-programming was
at least half of our goal.

That said, I doubt that sending out the model should be much of a bottleneck
as Matai Zaharia implies—at least in the cases we tested.  And even if it
was a bottleneck, one could argue that it probably shouldn't be. In the very
worst case (LDA) the model is 100 components X 10^4 dictionary size X 8
bytes per FP number, or 8 MB in all. Not too large. The smallest model is
the GMM @10 dimension. In this case, the model has 10 components X (10 X 10
covariance matrix + 10 dim mean vector) X 8 bytes, or roughly 10KB.  Tiny
even!

Matai Zaharia also says:

 Implementing ML algorithms well by hand is unfortunately difficult, and
 this 
 is why we have MLlib. The hope is that you either get your desired
 algorithm 
 out of the box or get a higher-level primitive (e.g. stochastic gradient 
 descent) that you can plug some functions into, without worrying about 
 the communication. 

I couldn’t agree more. But again, the idea of our benchmark was specifically
to consider the case of an expert who is facing just such an implementation
challenge: he/she needs a model for which a canned implementation does not
exist. 

Finally, let me say that we’d absolutely LOVE it if someone who is an active
Spark developer would take the time to implement one or more of these
algorithms and replicate our experiments (I’d be happy to help anyone out
who wants to do this—send me a message). It’s already been a year since we
did all of this, and for that reason alone the results might be quite
different.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-Comparison-of-Platforms-for-Implementing-and-Running-Very-Large-Scale-Machine-Learning-Algorithms-tp7823p8326.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional 

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-14 Thread Jeremy Freeman
@Ignacio, happy to share, here's a link to a library we've been developing 
(https://github.com/freeman-lab/thunder). As just a couple examples, we have 
pipelines that use fourier transforms and other signal processing from scipy, 
and others that do massively parallel model fitting via Scikit learn functions, 
etc. That should give you some idea of how such libraries could be usefully 
integrated into a PySpark project. Btw, a couple things we do overlap with 
functionality now available in MLLib via the Python API, which we're working on 
integrating.

On Aug 13, 2014, at 5:16 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com 
wrote:

 Yep, I thought it was a bogus comparison.
 
 I should rephrase my question as it was poorly phrased: on average, how
 much faster is Spark v. PySpark (I didn't really mean Scala v. Python)?
 I've only used Spark and don't have a chance to test this at the moment so
 if anybody has these numbers or general estimates (10x, etc), that'd be
 great.
 
 @Jeremy, if you can discuss this, what's an example of a project you
 implemented using these libraries + PySpark?
 
 Thanks everyone!
 
 
 
 
 On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:
 
 On a related note, I recently heard about Distributed R
 https://github.com/vertica/DistributedR, which is coming out of
 HP/Vertica and seems to be their proposition for machine learning at scale.
 
 It would be interesting to see some kind of comparison between that and
 MLlib (and perhaps also SparkR
 https://github.com/amplab-extras/SparkR-pkg?), especially since
 Distributed R has a concept of distributed arrays and works on data
 in-memory. Docs are here.
 https://github.com/vertica/DistributedR/tree/master/doc/platform
 
 Nick
 
 
 On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com wrote:
 
 They only compared their own implementations of couple algorithms on
 different platforms rather than comparing the different platforms
 themselves (in the case of Spark -- PySpark). I can write two variants of
 an algorithm on Spark and make them perform drastically differently.
 
 I have no doubt if you implement a ML algorithm in Python itself without
 any native libraries, the performance will be sub-optimal.
 
 What PySpark really provides is:
 
 - Using Spark transformations in Python
 - ML algorithms implemented in Scala (leveraging native numerical
 libraries
 for high performance), and callable in Python
 
 The paper claims Python is now one of the most popular languages for
 ML-oriented programming, and that's why they went ahead with Python.
 However, as I understand, very few people actually implement algorithms in
 Python directly because of the sub-optimal performance. Most people
 implement algorithms in other languages (e.g. C / Java), and expose APIs
 in
 Python for ease-of-use. This is what we are trying to do with PySpark as
 well.
 
 
 On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas 
 ignacio.zendejas...@gmail.com wrote:
 
 Has anyone had a chance to look at this paper (with title in subject)?
 http://www.cs.rice.edu/~lp6/comparison.pdf
 
 Interesting that they chose to use Python alone. Do we know how much
 faster
 Scala is vs. Python in general, if at all?
 
 As with any and all benchmarks, I'm sure there are caveats, but it'd be
 nice to have a response to the question above for starters.
 
 Thanks,
 Ignacio
 
 
 
 



Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-14 Thread Ignacio Zendejas
Thanks, Jeremy! That's awesome. There's a group at Facebook that is
considering using Spark, so to have more projects to refer to is great.

And Matei, I completely agree. MLlib is very exciting. I respect how well
you guys are managing the project for quality. This will set the Spark
ecosystem apart beyond the already impressive gains in performance and
productivity.

cheers,
Ignacio



On Thu, Aug 14, 2014 at 12:21 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Just as a note on this paper, apart from implementing the algorithms in
 naive Python, they also run it in a fairly inefficient way. In particular
 their implementations send the model out with every task closure, which is
 really expensive for a large model, and bring it back with collectAsMap().
 It would be much more efficient to send it e.g. with
 SparkContext.broadcast() or keep it distributed on the cluster throughout
 the computation, instead of making the drive node a bottleneck for
 communication.

 Implementing ML algorithms well by hand is unfortunately difficult, and
 this is why we have MLlib. The hope is that you either get your desired
 algorithm out of the box or get a higher-level primitive (e.g. stochastic
 gradient descent) that you can plug some functions into, without worrying
 about the communication.

 Matei

 On August 13, 2014 at 11:10:02 AM, Ignacio Zendejas (
 ignacio.zendejas...@gmail.com) wrote:

 Has anyone had a chance to look at this paper (with title in subject)?
 http://www.cs.rice.edu/~lp6/comparison.pdf

 Interesting that they chose to use Python alone. Do we know how much
 faster
 Scala is vs. Python in general, if at all?

 As with any and all benchmarks, I'm sure there are caveats, but it'd be
 nice to have a response to the question above for starters.

 Thanks,
 Ignacio




Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
They only compared their own implementations of couple algorithms on
different platforms rather than comparing the different platforms
themselves (in the case of Spark -- PySpark). I can write two variants of
an algorithm on Spark and make them perform drastically differently.

I have no doubt if you implement a ML algorithm in Python itself without
any native libraries, the performance will be sub-optimal.

What PySpark really provides is:

- Using Spark transformations in Python
- ML algorithms implemented in Scala (leveraging native numerical libraries
for high performance), and callable in Python

The paper claims Python is now one of the most popular languages for
ML-oriented programming, and that's why they went ahead with Python.
However, as I understand, very few people actually implement algorithms in
Python directly because of the sub-optimal performance. Most people
implement algorithms in other languages (e.g. C / Java), and expose APIs in
Python for ease-of-use. This is what we are trying to do with PySpark as
well.


On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas 
ignacio.zendejas...@gmail.com wrote:

 Has anyone had a chance to look at this paper (with title in subject)?
 http://www.cs.rice.edu/~lp6/comparison.pdf

 Interesting that they chose to use Python alone. Do we know how much faster
 Scala is vs. Python in general, if at all?

 As with any and all benchmarks, I'm sure there are caveats, but it'd be
 nice to have a response to the question above for starters.

 Thanks,
 Ignacio



Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Jeremy Freeman
Our experience matches Reynold's comments; pure-Python implementations of
anything are generally sub-optimal compared to pure Scala implementations,
or Scala versions exposed to Python (which are faster, but still slower than
pure Scala). It also seems on first glance that some of the implementations
in the paper themselves might not have been optimal (regardless of Python vs
Scala).

All that said, we have found it useful to implement some workflows purely in
Python, mainly when we want to exploit libraries like NumPy, SciPy, or
Scikit Learn, or incorporate existing Python code bases, in which case the
flexibility is worth a drop in performance, at least for us! This might also
make more sense for specialized routines as opposed to core, low-level
algorithms.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-Comparison-of-Platforms-for-Implementing-and-Running-Very-Large-Scale-Machine-Learning-Algorithms-tp7823p7825.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Nicholas Chammas
On a related note, I recently heard about Distributed R
https://github.com/vertica/DistributedR, which is coming out of
HP/Vertica and seems to be their proposition for machine learning at scale.

It would be interesting to see some kind of comparison between that and
MLlib (and perhaps also SparkR https://github.com/amplab-extras/SparkR-pkg?),
especially since Distributed R has a concept of distributed arrays and
works on data in-memory. Docs are here.
https://github.com/vertica/DistributedR/tree/master/doc/platform

Nick


On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com wrote:

 They only compared their own implementations of couple algorithms on
 different platforms rather than comparing the different platforms
 themselves (in the case of Spark -- PySpark). I can write two variants of
 an algorithm on Spark and make them perform drastically differently.

 I have no doubt if you implement a ML algorithm in Python itself without
 any native libraries, the performance will be sub-optimal.

 What PySpark really provides is:

 - Using Spark transformations in Python
 - ML algorithms implemented in Scala (leveraging native numerical libraries
 for high performance), and callable in Python

 The paper claims Python is now one of the most popular languages for
 ML-oriented programming, and that's why they went ahead with Python.
 However, as I understand, very few people actually implement algorithms in
 Python directly because of the sub-optimal performance. Most people
 implement algorithms in other languages (e.g. C / Java), and expose APIs in
 Python for ease-of-use. This is what we are trying to do with PySpark as
 well.


 On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas 
 ignacio.zendejas...@gmail.com wrote:

  Has anyone had a chance to look at this paper (with title in subject)?
  http://www.cs.rice.edu/~lp6/comparison.pdf
 
  Interesting that they chose to use Python alone. Do we know how much
 faster
  Scala is vs. Python in general, if at all?
 
  As with any and all benchmarks, I'm sure there are caveats, but it'd be
  nice to have a response to the question above for starters.
 
  Thanks,
  Ignacio
 



Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
Actually I believe the same person started both projects.

The Distributed R project from HP was started by Shivaram Venkataraman when
he was there. He since moved to Berkeley AMPLab to pursue a PhD and SparkR
was his latest project.



On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 On a related note, I recently heard about Distributed R
 https://github.com/vertica/DistributedR, which is coming out of
 HP/Vertica and seems to be their proposition for machine learning at scale.

 It would be interesting to see some kind of comparison between that and
 MLlib (and perhaps also SparkR
 https://github.com/amplab-extras/SparkR-pkg?), especially since
 Distributed R has a concept of distributed arrays and works on data
 in-memory. Docs are here.
 https://github.com/vertica/DistributedR/tree/master/doc/platform

 Nick


 On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com wrote:

 They only compared their own implementations of couple algorithms on
 different platforms rather than comparing the different platforms
 themselves (in the case of Spark -- PySpark). I can write two variants of
 an algorithm on Spark and make them perform drastically differently.

 I have no doubt if you implement a ML algorithm in Python itself without
 any native libraries, the performance will be sub-optimal.

 What PySpark really provides is:

 - Using Spark transformations in Python
 - ML algorithms implemented in Scala (leveraging native numerical
 libraries
 for high performance), and callable in Python

 The paper claims Python is now one of the most popular languages for
 ML-oriented programming, and that's why they went ahead with Python.
 However, as I understand, very few people actually implement algorithms in
 Python directly because of the sub-optimal performance. Most people
 implement algorithms in other languages (e.g. C / Java), and expose APIs
 in
 Python for ease-of-use. This is what we are trying to do with PySpark as
 well.


 On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas 
 ignacio.zendejas...@gmail.com wrote:

  Has anyone had a chance to look at this paper (with title in subject)?
  http://www.cs.rice.edu/~lp6/comparison.pdf
 
  Interesting that they chose to use Python alone. Do we know how much
 faster
  Scala is vs. Python in general, if at all?
 
  As with any and all benchmarks, I'm sure there are caveats, but it'd be
  nice to have a response to the question above for starters.
 
  Thanks,
  Ignacio
 





Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Reynold Xin
BTW you can find the original Presto (rebranded as Distributed R) paper
here:
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Venkataraman.pdf


On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin r...@databricks.com wrote:

 Actually I believe the same person started both projects.

 The Distributed R project from HP was started by Shivaram Venkataraman
 when he was there. He since moved to Berkeley AMPLab to pursue a PhD and
 SparkR was his latest project.



 On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 On a related note, I recently heard about Distributed R
 https://github.com/vertica/DistributedR, which is coming out of
 HP/Vertica and seems to be their proposition for machine learning at scale.

 It would be interesting to see some kind of comparison between that and
 MLlib (and perhaps also SparkR
 https://github.com/amplab-extras/SparkR-pkg?), especially since
 Distributed R has a concept of distributed arrays and works on data
 in-memory. Docs are here.
 https://github.com/vertica/DistributedR/tree/master/doc/platform

 Nick


 On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com wrote:

 They only compared their own implementations of couple algorithms on
 different platforms rather than comparing the different platforms
 themselves (in the case of Spark -- PySpark). I can write two variants of
 an algorithm on Spark and make them perform drastically differently.

 I have no doubt if you implement a ML algorithm in Python itself without
 any native libraries, the performance will be sub-optimal.

 What PySpark really provides is:

 - Using Spark transformations in Python
 - ML algorithms implemented in Scala (leveraging native numerical
 libraries
 for high performance), and callable in Python

 The paper claims Python is now one of the most popular languages for
 ML-oriented programming, and that's why they went ahead with Python.
 However, as I understand, very few people actually implement algorithms
 in
 Python directly because of the sub-optimal performance. Most people
 implement algorithms in other languages (e.g. C / Java), and expose APIs
 in
 Python for ease-of-use. This is what we are trying to do with PySpark as
 well.


 On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas 
 ignacio.zendejas...@gmail.com wrote:

  Has anyone had a chance to look at this paper (with title in subject)?
  http://www.cs.rice.edu/~lp6/comparison.pdf
 
  Interesting that they chose to use Python alone. Do we know how much
 faster
  Scala is vs. Python in general, if at all?
 
  As with any and all benchmarks, I'm sure there are caveats, but it'd be
  nice to have a response to the question above for starters.
 
  Thanks,
  Ignacio
 






Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Shivaram Venkataraman
Yeah I worked on DistributedR while I was an intern at HP Labs, but it has
evolved a lot since then. I don't think its a direct comparison as
DistributedR is a pure R implementation in a distributed setting while
SparkR is a wrapper around the Scala / Java implementations in Spark.

That said, it would be an interesting exercise to compare them and I hope
to do it at some point.

Shivaram


On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin r...@databricks.com wrote:

 Actually I believe the same person started both projects.

 The Distributed R project from HP was started by Shivaram Venkataraman when
 he was there. He since moved to Berkeley AMPLab to pursue a PhD and SparkR
 was his latest project.



 On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  On a related note, I recently heard about Distributed R
  https://github.com/vertica/DistributedR, which is coming out of
  HP/Vertica and seems to be their proposition for machine learning at
 scale.
 
  It would be interesting to see some kind of comparison between that and
  MLlib (and perhaps also SparkR
  https://github.com/amplab-extras/SparkR-pkg?), especially since
  Distributed R has a concept of distributed arrays and works on data
  in-memory. Docs are here.
  https://github.com/vertica/DistributedR/tree/master/doc/platform
 
  Nick
 
 
  On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com
 wrote:
 
  They only compared their own implementations of couple algorithms on
  different platforms rather than comparing the different platforms
  themselves (in the case of Spark -- PySpark). I can write two variants
 of
  an algorithm on Spark and make them perform drastically differently.
 
  I have no doubt if you implement a ML algorithm in Python itself without
  any native libraries, the performance will be sub-optimal.
 
  What PySpark really provides is:
 
  - Using Spark transformations in Python
  - ML algorithms implemented in Scala (leveraging native numerical
  libraries
  for high performance), and callable in Python
 
  The paper claims Python is now one of the most popular languages for
  ML-oriented programming, and that's why they went ahead with Python.
  However, as I understand, very few people actually implement algorithms
 in
  Python directly because of the sub-optimal performance. Most people
  implement algorithms in other languages (e.g. C / Java), and expose APIs
  in
  Python for ease-of-use. This is what we are trying to do with PySpark as
  well.
 
 
  On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas 
  ignacio.zendejas...@gmail.com wrote:
 
   Has anyone had a chance to look at this paper (with title in subject)?
   http://www.cs.rice.edu/~lp6/comparison.pdf
  
   Interesting that they chose to use Python alone. Do we know how much
  faster
   Scala is vs. Python in general, if at all?
  
   As with any and all benchmarks, I'm sure there are caveats, but it'd
 be
   nice to have a response to the question above for starters.
  
   Thanks,
   Ignacio
  
 
 
 



Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas
ignacio.zendejas...@gmail.com wrote:
 Yep, I thought it was a bogus comparison.

 I should rephrase my question as it was poorly phrased: on average, how
 much faster is Spark v. PySpark (I didn't really mean Scala v. Python)?
 I've only used Spark and don't have a chance to test this at the moment so
 if anybody has these numbers or general estimates (10x, etc), that'd be
 great.

A quick comparison by word count on 4.3G text file (local mode),

Spark:  40 seconds
PySpark: 2 minutes and 16 seconds

So PySpark is 3.4x slower than Spark.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:31 PM, Davies Liu dav...@databricks.com wrote:
 On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas
 ignacio.zendejas...@gmail.com wrote:
 Yep, I thought it was a bogus comparison.

 I should rephrase my question as it was poorly phrased: on average, how
 much faster is Spark v. PySpark (I didn't really mean Scala v. Python)?
 I've only used Spark and don't have a chance to test this at the moment so
 if anybody has these numbers or general estimates (10x, etc), that'd be
 great.

 A quick comparison by word count on 4.3G text file (local mode),

 Spark:  40 seconds
 PySpark: 2 minutes and 16 seconds

 So PySpark is 3.4x slower than Spark.

I also tried DPark, which is a pure Python clone of Spark:

DPark: 53 seconds

so it's 2 times faster than PySpark, because of it does not have
the over head of passing data between JVM and Python.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org