spark git commit: SPARK-3642. Document the nuances of shared variables.

srowen Wed, 11 Mar 2015 06:22:24 -0700

Repository: spark
Updated Branches:
  refs/heads/master 548643a9e -> 2d87a415f



SPARK-3642. Document the nuances of shared variables.

Author: Sandy Ryza <[email protected]>

Closes #2490 from sryza/sandy-spark-3642 and squashes the following commits:

aae3340 [Sandy Ryza] SPARK-3642. Document the nuances of broadcast variables


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2d87a415
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2d87a415
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2d87a415

Branch: refs/heads/master
Commit: 2d87a415f20c85487537d6791a73827ff537f2c0
Parents: 548643a
Author: Sandy Ryza <[email protected]>
Authored: Wed Mar 11 13:22:05 2015 +0000
Committer: Sean Owen <[email protected]>
Committed: Wed Mar 11 13:22:05 2015 +0000

----------------------------------------------------------------------
 docs/programming-guide.md | 6 ++++++
 1 file changed, 6 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/2d87a415/docs/programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index c011a84..eda3a95 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -1207,6 +1207,12 @@ than shipping a copy of it with tasks. They can be used, 
for example, to give ev
 large input dataset in an efficient manner. Spark also attempts to distribute 
broadcast variables
 using efficient broadcast algorithms to reduce communication cost.
 
+Spark actions are executed through a set of stages, separated by distributed 
"shuffle" operations.
+Spark automatically broadcasts the common data needed by tasks within each 
stage. The data
+broadcasted this way is cached in serialized form and deserialized before 
running each task. This
+means that explicitly creating broadcast variables is only useful when tasks 
across multiple stages
+need the same data or when caching the data in deserialized form is important.
+
 Broadcast variables are created from a variable `v` by calling 
`SparkContext.broadcast(v)`. The
 broadcast variable is a wrapper around `v`, and its value can be accessed by 
calling the `value`
 method. The code below shows this:


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: SPARK-3642. Document the nuances of shared variables.

Reply via email to