RE: Which strategy is used for broadcast variables?

2015-03-11 Thread Mosharaf Chowdhury
Hi Tom, That's an outdated document from 4/5 years ago. Spark currently uses a BitTorrent like mechanism that's been tuned for datacenter environments. Mosharaf -Original Message- From: Tom thubregt...@gmail.com Sent: ‎3/‎11/‎2015 4:58 PM To: user@spark.apache.org

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Mosharaf Chowdhury
The current broadcast algorithm in Spark approximates the one described in the Section 5 of this paper http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf. It is expected to scale sub-linearly; i.e., O(log N), where N is the number of machines in your cluster. We evaluated up to 100

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
Those results look very good for the larger workloads (100MB and 1GB). Were you also able to run experiments for smaller amounts of data? For instance broadcasting a single variable to the entire cluster? In the paper you state that HDFS-based mechanisms performed well only for small amounts of

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
Thanks Mosharaf, for the quick response! Can you maybe give me some pointers to an explanation of this strategy? Or elaborate a bit more on it? Which parts are involved in which way? Where are the time penalties and how scalable is this implementation? Thanks again, Tom On 11 March 2015 at