Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
I’m also one of the authors of this paper and I am responsible for the Spark experiments in this paper. Thank you for your guys discussion! (1) Ignacio Zendejas wrote I should rephrase my question as it was poorly phrased: on average, how much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? I've only used Spark and don't have a chance to test this at the moment so if anybody has these numbers or general estimates (10x, etc), that'd be great. Davies Liu wrote A quick comparison by word count on 4.3G text file (local mode), Spark: 40 seconds PySpark: 2 minutes and 16 seconds So PySpark is 3.4x slower than Spark. From my perspective, it is a difficult task to compare the speed between scala python, or spark pyspark. Simple examples may not be enough for us to draw a conclusion on such comparison. We may need more complex models for testing to obtain more comprehensive ideas. It is possible that spark is fast in some applications, but it is slower than pyspark in others. So the speed issue should be application specific. It is also one of the purpose for our paper: shed some light such benchmarks for platform/performance comparison. (2) Matei Zaharia wrote Just as a note on this paper, apart from implementing the algorithms in naive Python, they also run it in a fairly inefficient way. In particular their implementations send the model out with every task closure, which is really expensive for a large model, and bring it back with collectAsMap(). It would be much more efficient to send it e.g. with SparkContext.broadcast() or keep it distributed on the cluster throughout the computation, instead of making the drive node a bottleneck for communication. We have tried our best to write several implementation methods for each model, in order to pick up the optimal one. Some functions may seem promising, but they fail when we did our experiments. Broadcast() is a good idea, and we can try it for our models to see if it can bring much difference. But as cjermaine said, broadcast models should not be a bottleneck because the models are small int all experiments. Also, we may change the parameters of models in each iteration, so the one time broadcast may not provide so much help as expected. Moreover, we are really careful when we use collect() or collectAsMap(). In our experiments, we do not collect large sets of data by collectAsMap(), and it does not consume much time either. Overall, I have no doubt that Spark developers can write more efficient code for our models, and it is very welcome that some Spark developers can provide better implementations for our experiments. Thanks! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-Comparison-of-Platforms-for-Implementing-and-Running-Very-Large-Scale-Machine-Learning-Algorithms-tp7823p8485.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
I’m out of the authors of this paper, and I just came across this thread. I’m glad that Ignacio Zendejas noticed our paper! First off, let me post link to the published version of the paper, which is likely slightly different than the version linked above: http://cmj4.web.rice.edu/performance.pdf http://cmj4.web.rice.edu/performance.pdf Next, I just want to quickly address a couple of comments made here. rxin says: They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. It’s a bit misleading to say that we just tried a “couple” of algorithms; the paper describes five different algorithms, along with multiple implementations on each; we tried a LOT of variants of each algorithm, as we detail in the paper. Also, it is true that we did use our own implementations; the point was to compare each platform as a programming and execution platform. The paper is clear that the benchmark was directed towards “a user who wants to run a specific ML inference algorithm over a large data set, but cannot find an existing implementation and thus must 'roll her own' ML code.” We specifically state that we are not interested in comparing canned libraries, which is a very different task. Both ease-of-use and speed were considered as being equally important. If you read the paper, you’ll see that we generally gave PySpark high marks as a programming platform Several of the posts here imply that all Spark experiments were using PySpark. This is not true. Matei Zaharia says: Just as a note on this paper, apart from implementing the algorithms in naive Python… And Ignacio Zendajas says: Interesting that they chose to use Python alone… In reality, the paper also describes pure Java implementations that ran on Spark, with no Python, for two models: a Gaussian mixture model and LDA. For the GMM, Java ran in 40-50% of the time compared to Python (though to be fair, a lot of that is due to the fact that GMM inference is linear-algebra-intensive; it’s not easy to do linear algebra in the JVM for reasons I’ll not get into here… it’s possible someone else could do a lot better). On LDA, Java was less than 10% of the Python time (that is, much faster). Matai Zaharia also says: they also run it in a fairly inefficient way. In particular their implementations send the model out with every task closure, which is really expensive for a large model, and bring it back with collectAsMap(). It’s certainly possible our implementations were sub-optimal. All I can say is that again, the goal of the paper was to chronicle our experiences using each platform as both a programming and execution platform. While doubtless Spark’s developers could have done a better job of writing code for Spark, I hope we didn’t do too badly! And again, evaluating ease-of-programming was at least half of our goal. That said, I doubt that sending out the model should be much of a bottleneck as Matai Zaharia implies—at least in the cases we tested. And even if it was a bottleneck, one could argue that it probably shouldn't be. In the very worst case (LDA) the model is 100 components X 10^4 dictionary size X 8 bytes per FP number, or 8 MB in all. Not too large. The smallest model is the GMM @10 dimension. In this case, the model has 10 components X (10 X 10 covariance matrix + 10 dim mean vector) X 8 bytes, or roughly 10KB. Tiny even! Matai Zaharia also says: Implementing ML algorithms well by hand is unfortunately difficult, and this is why we have MLlib. The hope is that you either get your desired algorithm out of the box or get a higher-level primitive (e.g. stochastic gradient descent) that you can plug some functions into, without worrying about the communication. I couldn’t agree more. But again, the idea of our benchmark was specifically to consider the case of an expert who is facing just such an implementation challenge: he/she needs a model for which a canned implementation does not exist. Finally, let me say that we’d absolutely LOVE it if someone who is an active Spark developer would take the time to implement one or more of these algorithms and replicate our experiments (I’d be happy to help anyone out who wants to do this—send me a message). It’s already been a year since we did all of this, and for that reason alone the results might be quite different. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-Comparison-of-Platforms-for-Implementing-and-Running-Very-Large-Scale-Machine-Learning-Algorithms-tp7823p8326.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
@Ignacio, happy to share, here's a link to a library we've been developing (https://github.com/freeman-lab/thunder). As just a couple examples, we have pipelines that use fourier transforms and other signal processing from scipy, and others that do massively parallel model fitting via Scikit learn functions, etc. That should give you some idea of how such libraries could be usefully integrated into a PySpark project. Btw, a couple things we do overlap with functionality now available in MLLib via the Python API, which we're working on integrating. On Aug 13, 2014, at 5:16 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Yep, I thought it was a bogus comparison. I should rephrase my question as it was poorly phrased: on average, how much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? I've only used Spark and don't have a chance to test this at the moment so if anybody has these numbers or general estimates (10x, etc), that'd be great. @Jeremy, if you can discuss this, what's an example of a project you implemented using these libraries + PySpark? Thanks everyone! On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On a related note, I recently heard about Distributed R https://github.com/vertica/DistributedR, which is coming out of HP/Vertica and seems to be their proposition for machine learning at scale. It would be interesting to see some kind of comparison between that and MLlib (and perhaps also SparkR https://github.com/amplab-extras/SparkR-pkg?), especially since Distributed R has a concept of distributed arrays and works on data in-memory. Docs are here. https://github.com/vertica/DistributedR/tree/master/doc/platform Nick On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com wrote: They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. I have no doubt if you implement a ML algorithm in Python itself without any native libraries, the performance will be sub-optimal. What PySpark really provides is: - Using Spark transformations in Python - ML algorithms implemented in Scala (leveraging native numerical libraries for high performance), and callable in Python The paper claims Python is now one of the most popular languages for ML-oriented programming, and that's why they went ahead with Python. However, as I understand, very few people actually implement algorithms in Python directly because of the sub-optimal performance. Most people implement algorithms in other languages (e.g. C / Java), and expose APIs in Python for ease-of-use. This is what we are trying to do with PySpark as well. On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Has anyone had a chance to look at this paper (with title in subject)? http://www.cs.rice.edu/~lp6/comparison.pdf Interesting that they chose to use Python alone. Do we know how much faster Scala is vs. Python in general, if at all? As with any and all benchmarks, I'm sure there are caveats, but it'd be nice to have a response to the question above for starters. Thanks, Ignacio
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
Thanks, Jeremy! That's awesome. There's a group at Facebook that is considering using Spark, so to have more projects to refer to is great. And Matei, I completely agree. MLlib is very exciting. I respect how well you guys are managing the project for quality. This will set the Spark ecosystem apart beyond the already impressive gains in performance and productivity. cheers, Ignacio On Thu, Aug 14, 2014 at 12:21 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Just as a note on this paper, apart from implementing the algorithms in naive Python, they also run it in a fairly inefficient way. In particular their implementations send the model out with every task closure, which is really expensive for a large model, and bring it back with collectAsMap(). It would be much more efficient to send it e.g. with SparkContext.broadcast() or keep it distributed on the cluster throughout the computation, instead of making the drive node a bottleneck for communication. Implementing ML algorithms well by hand is unfortunately difficult, and this is why we have MLlib. The hope is that you either get your desired algorithm out of the box or get a higher-level primitive (e.g. stochastic gradient descent) that you can plug some functions into, without worrying about the communication. Matei On August 13, 2014 at 11:10:02 AM, Ignacio Zendejas ( ignacio.zendejas...@gmail.com) wrote: Has anyone had a chance to look at this paper (with title in subject)? http://www.cs.rice.edu/~lp6/comparison.pdf Interesting that they chose to use Python alone. Do we know how much faster Scala is vs. Python in general, if at all? As with any and all benchmarks, I'm sure there are caveats, but it'd be nice to have a response to the question above for starters. Thanks, Ignacio
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. I have no doubt if you implement a ML algorithm in Python itself without any native libraries, the performance will be sub-optimal. What PySpark really provides is: - Using Spark transformations in Python - ML algorithms implemented in Scala (leveraging native numerical libraries for high performance), and callable in Python The paper claims Python is now one of the most popular languages for ML-oriented programming, and that's why they went ahead with Python. However, as I understand, very few people actually implement algorithms in Python directly because of the sub-optimal performance. Most people implement algorithms in other languages (e.g. C / Java), and expose APIs in Python for ease-of-use. This is what we are trying to do with PySpark as well. On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Has anyone had a chance to look at this paper (with title in subject)? http://www.cs.rice.edu/~lp6/comparison.pdf Interesting that they chose to use Python alone. Do we know how much faster Scala is vs. Python in general, if at all? As with any and all benchmarks, I'm sure there are caveats, but it'd be nice to have a response to the question above for starters. Thanks, Ignacio
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
Our experience matches Reynold's comments; pure-Python implementations of anything are generally sub-optimal compared to pure Scala implementations, or Scala versions exposed to Python (which are faster, but still slower than pure Scala). It also seems on first glance that some of the implementations in the paper themselves might not have been optimal (regardless of Python vs Scala). All that said, we have found it useful to implement some workflows purely in Python, mainly when we want to exploit libraries like NumPy, SciPy, or Scikit Learn, or incorporate existing Python code bases, in which case the flexibility is worth a drop in performance, at least for us! This might also make more sense for specialized routines as opposed to core, low-level algorithms. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/A-Comparison-of-Platforms-for-Implementing-and-Running-Very-Large-Scale-Machine-Learning-Algorithms-tp7823p7825.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
On a related note, I recently heard about Distributed R https://github.com/vertica/DistributedR, which is coming out of HP/Vertica and seems to be their proposition for machine learning at scale. It would be interesting to see some kind of comparison between that and MLlib (and perhaps also SparkR https://github.com/amplab-extras/SparkR-pkg?), especially since Distributed R has a concept of distributed arrays and works on data in-memory. Docs are here. https://github.com/vertica/DistributedR/tree/master/doc/platform Nick On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com wrote: They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. I have no doubt if you implement a ML algorithm in Python itself without any native libraries, the performance will be sub-optimal. What PySpark really provides is: - Using Spark transformations in Python - ML algorithms implemented in Scala (leveraging native numerical libraries for high performance), and callable in Python The paper claims Python is now one of the most popular languages for ML-oriented programming, and that's why they went ahead with Python. However, as I understand, very few people actually implement algorithms in Python directly because of the sub-optimal performance. Most people implement algorithms in other languages (e.g. C / Java), and expose APIs in Python for ease-of-use. This is what we are trying to do with PySpark as well. On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Has anyone had a chance to look at this paper (with title in subject)? http://www.cs.rice.edu/~lp6/comparison.pdf Interesting that they chose to use Python alone. Do we know how much faster Scala is vs. Python in general, if at all? As with any and all benchmarks, I'm sure there are caveats, but it'd be nice to have a response to the question above for starters. Thanks, Ignacio
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
Actually I believe the same person started both projects. The Distributed R project from HP was started by Shivaram Venkataraman when he was there. He since moved to Berkeley AMPLab to pursue a PhD and SparkR was his latest project. On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On a related note, I recently heard about Distributed R https://github.com/vertica/DistributedR, which is coming out of HP/Vertica and seems to be their proposition for machine learning at scale. It would be interesting to see some kind of comparison between that and MLlib (and perhaps also SparkR https://github.com/amplab-extras/SparkR-pkg?), especially since Distributed R has a concept of distributed arrays and works on data in-memory. Docs are here. https://github.com/vertica/DistributedR/tree/master/doc/platform Nick On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com wrote: They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. I have no doubt if you implement a ML algorithm in Python itself without any native libraries, the performance will be sub-optimal. What PySpark really provides is: - Using Spark transformations in Python - ML algorithms implemented in Scala (leveraging native numerical libraries for high performance), and callable in Python The paper claims Python is now one of the most popular languages for ML-oriented programming, and that's why they went ahead with Python. However, as I understand, very few people actually implement algorithms in Python directly because of the sub-optimal performance. Most people implement algorithms in other languages (e.g. C / Java), and expose APIs in Python for ease-of-use. This is what we are trying to do with PySpark as well. On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Has anyone had a chance to look at this paper (with title in subject)? http://www.cs.rice.edu/~lp6/comparison.pdf Interesting that they chose to use Python alone. Do we know how much faster Scala is vs. Python in general, if at all? As with any and all benchmarks, I'm sure there are caveats, but it'd be nice to have a response to the question above for starters. Thanks, Ignacio
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
BTW you can find the original Presto (rebranded as Distributed R) paper here: http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Venkataraman.pdf On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin r...@databricks.com wrote: Actually I believe the same person started both projects. The Distributed R project from HP was started by Shivaram Venkataraman when he was there. He since moved to Berkeley AMPLab to pursue a PhD and SparkR was his latest project. On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On a related note, I recently heard about Distributed R https://github.com/vertica/DistributedR, which is coming out of HP/Vertica and seems to be their proposition for machine learning at scale. It would be interesting to see some kind of comparison between that and MLlib (and perhaps also SparkR https://github.com/amplab-extras/SparkR-pkg?), especially since Distributed R has a concept of distributed arrays and works on data in-memory. Docs are here. https://github.com/vertica/DistributedR/tree/master/doc/platform Nick On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com wrote: They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. I have no doubt if you implement a ML algorithm in Python itself without any native libraries, the performance will be sub-optimal. What PySpark really provides is: - Using Spark transformations in Python - ML algorithms implemented in Scala (leveraging native numerical libraries for high performance), and callable in Python The paper claims Python is now one of the most popular languages for ML-oriented programming, and that's why they went ahead with Python. However, as I understand, very few people actually implement algorithms in Python directly because of the sub-optimal performance. Most people implement algorithms in other languages (e.g. C / Java), and expose APIs in Python for ease-of-use. This is what we are trying to do with PySpark as well. On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Has anyone had a chance to look at this paper (with title in subject)? http://www.cs.rice.edu/~lp6/comparison.pdf Interesting that they chose to use Python alone. Do we know how much faster Scala is vs. Python in general, if at all? As with any and all benchmarks, I'm sure there are caveats, but it'd be nice to have a response to the question above for starters. Thanks, Ignacio
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
Yeah I worked on DistributedR while I was an intern at HP Labs, but it has evolved a lot since then. I don't think its a direct comparison as DistributedR is a pure R implementation in a distributed setting while SparkR is a wrapper around the Scala / Java implementations in Spark. That said, it would be an interesting exercise to compare them and I hope to do it at some point. Shivaram On Wed, Aug 13, 2014 at 2:16 PM, Reynold Xin r...@databricks.com wrote: Actually I believe the same person started both projects. The Distributed R project from HP was started by Shivaram Venkataraman when he was there. He since moved to Berkeley AMPLab to pursue a PhD and SparkR was his latest project. On Wed, Aug 13, 2014 at 1:04 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On a related note, I recently heard about Distributed R https://github.com/vertica/DistributedR, which is coming out of HP/Vertica and seems to be their proposition for machine learning at scale. It would be interesting to see some kind of comparison between that and MLlib (and perhaps also SparkR https://github.com/amplab-extras/SparkR-pkg?), especially since Distributed R has a concept of distributed arrays and works on data in-memory. Docs are here. https://github.com/vertica/DistributedR/tree/master/doc/platform Nick On Wed, Aug 13, 2014 at 3:29 PM, Reynold Xin r...@databricks.com wrote: They only compared their own implementations of couple algorithms on different platforms rather than comparing the different platforms themselves (in the case of Spark -- PySpark). I can write two variants of an algorithm on Spark and make them perform drastically differently. I have no doubt if you implement a ML algorithm in Python itself without any native libraries, the performance will be sub-optimal. What PySpark really provides is: - Using Spark transformations in Python - ML algorithms implemented in Scala (leveraging native numerical libraries for high performance), and callable in Python The paper claims Python is now one of the most popular languages for ML-oriented programming, and that's why they went ahead with Python. However, as I understand, very few people actually implement algorithms in Python directly because of the sub-optimal performance. Most people implement algorithms in other languages (e.g. C / Java), and expose APIs in Python for ease-of-use. This is what we are trying to do with PySpark as well. On Wed, Aug 13, 2014 at 11:09 AM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Has anyone had a chance to look at this paper (with title in subject)? http://www.cs.rice.edu/~lp6/comparison.pdf Interesting that they chose to use Python alone. Do we know how much faster Scala is vs. Python in general, if at all? As with any and all benchmarks, I'm sure there are caveats, but it'd be nice to have a response to the question above for starters. Thanks, Ignacio
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Yep, I thought it was a bogus comparison. I should rephrase my question as it was poorly phrased: on average, how much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? I've only used Spark and don't have a chance to test this at the moment so if anybody has these numbers or general estimates (10x, etc), that'd be great. A quick comparison by word count on 4.3G text file (local mode), Spark: 40 seconds PySpark: 2 minutes and 16 seconds So PySpark is 3.4x slower than Spark. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms
On Wed, Aug 13, 2014 at 2:31 PM, Davies Liu dav...@databricks.com wrote: On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Yep, I thought it was a bogus comparison. I should rephrase my question as it was poorly phrased: on average, how much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? I've only used Spark and don't have a chance to test this at the moment so if anybody has these numbers or general estimates (10x, etc), that'd be great. A quick comparison by word count on 4.3G text file (local mode), Spark: 40 seconds PySpark: 2 minutes and 16 seconds So PySpark is 3.4x slower than Spark. I also tried DPark, which is a pure Python clone of Spark: DPark: 53 seconds so it's 2 times faster than PySpark, because of it does not have the over head of passing data between JVM and Python. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org