Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

Seraph Sat, 20 Sep 2014 15:06:00 -0700

I’m also one of the authors of this paper and I am responsible for the Spark
experiments in this paper. Thank you for your guys discussion!


(1)

Ignacio Zendejas wrote
> I should rephrase my question as it was poorly phrased: on average, how 
> much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? 
> I've only used Spark and don't have a chance to test this at the moment so 
> if anybody has these numbers or general estimates (10x, etc), that'd be 
> great. 


Davies Liu wrote
> A quick comparison by word count on 4.3G text file (local mode), 
> 
> Spark:  40 seconds 
> PySpark: 2 minutes and 16 seconds 
> 
> So PySpark is 3.4x slower than Spark. 

>From my perspective, it is a difficult task to compare the speed between
"scala & python", or "spark & pyspark". Simple examples may not be enough
for us to draw a conclusion on such comparison. We may need more complex
models for testing to obtain more comprehensive ideas. It is possible that
spark is fast in some applications, but it is slower than pyspark in others.
So the speed issue should be application specific. It is also one of the
purpose for our paper: shed some light such benchmarks for
platform/performance comparison.

(2)

Matei Zaharia wrote
> Just as a note on this paper, apart from implementing the algorithms in
> naive Python, they also run it in a fairly inefficient way. In particular
> their implementations send the model out with every task closure, which is
> really expensive for a large model, and bring it back with collectAsMap().
> It would be much more efficient to send it e.g. with
> SparkContext.broadcast() or keep it distributed on the cluster throughout
> the computation, instead of making the drive node a bottleneck for
> communication. 

We have tried our best to write several implementation methods for each
model, in order to pick up the optimal one. Some functions may seem
promising, but they fail when we did our experiments. Broadcast() is a good
idea, and we can try it for our models to see if it can bring much
difference. But as cjermaine said, "broadcast models" should not be a
bottleneck because the models are small int all experiments. Also, we may
change the parameters of models in each iteration, so the "one time
broadcast" may not provide so much help as expected. Moreover, we are really
careful when we use collect() or collectAsMap(). In our experiments, we do
not collect large sets of data by collectAsMap(), and it does not consume
much time either. 

Overall, I have no doubt that Spark developers can write more efficient code
for our models, and it is very welcome that some Spark developers can
provide better implementations for our experiments.

Thanks!




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/A-Comparison-of-Platforms-for-Implementing-and-Running-Very-Large-Scale-Machine-Learning-Algorithms-tp7823p8485.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

Reply via email to