Hi Tom,
That's an outdated document from 4/5 years ago.
Spark currently uses a BitTorrent like mechanism that's been tuned for
datacenter environments.
Mosharaf
-Original Message-
From: Tom thubregt...@gmail.com
Sent: 3/11/2015 4:58 PM
To: user@spark.apache.org
The current broadcast algorithm in Spark approximates the one described in
the Section 5 of this paper
http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf.
It is expected to scale sub-linearly; i.e., O(log N), where N is the number
of machines in your cluster.
We evaluated up to 100
Those results look very good for the larger workloads (100MB and 1GB). Were
you also able to run experiments for smaller amounts of data? For instance
broadcasting a single variable to the entire cluster? In the paper you
state that HDFS-based mechanisms performed well only for small amounts of
Thanks Mosharaf, for the quick response! Can you maybe give me some
pointers to an explanation of this strategy? Or elaborate a bit more on it?
Which parts are involved in which way? Where are the time penalties and how
scalable is this implementation?
Thanks again,
Tom
On 11 March 2015 at