GitHub user willb opened a pull request:
https://github.com/apache/spark/pull/1322
spark-729: predictable closure capture
SPARK-729 concerns when free variables in closure arguments to
transformations are captured. Currently, it is possible for closures to get the
environment in which they are serialized (not the environment in which they are
created). This PR causes free variables in closure arguments to RDD
transformations to be captured at closure creation time by modifying
`ClosureCleaner` to serialize and deserialize its argument.
This PR is based on #189 (which is closed) but has fixes to work with some
changes in 1.0. In particular, it ensures that the cloned `Broadcast` objects
produced by closure capture are registered with `ContextCleaner` so that
broadcast variables won't become invalid simply because variable capture
(implemented this way) causes strong references to the original broadcast
variables to go away.
(See #189 for additional discussion and background.)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/willb/spark spark-729
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1322.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1322
----
commit c24b7c8e92c035cd5b48acf11fb29dc060f68fca
Author: William Benton <[email protected]>
Date: 2014-07-04T16:26:59Z
Added reference counting for Broadcasts.
commit b10f613b6debe78ec1a1d53dd7f28168f8adb359
Author: William Benton <[email protected]>
Date: 2014-07-07T17:52:26Z
Added ContextCleaner.withCurrentCleaner
This method allows code that needs access to the currently-active
ContextCleaner to access it via a DynamicVariable.
commit 14e074a59616732d422454c1ce881bc9b29060cd
Author: William Benton <[email protected]>
Date: 2014-03-18T14:55:57Z
Added tests for variable capture in closures
The two tests added to ClosureCleanerSuite ensure that variable values
are captured at RDD definition time, not at job-execution time.
commit 4e026a9d130c6fc92ce4cd73a68de013ed7aee5d
Author: William Benton <[email protected]>
Date: 2014-03-20T15:48:17Z
Predictable closure environment capture
The environments of serializable closures are now captured as
part of closure cleaning. Since we already proactively check most
closures for serializability, ClosureCleaner.clean now returns
the result of deserializing the serialized version of the cleaned
closure.
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
commit 39960620eb9c37f92ade2c35e2d5402cad6dd686
Author: William Benton <[email protected]>
Date: 2014-03-26T04:45:45Z
Skip proactive closure capture for runJob
There are two possible cases for runJob calls: either they are called
by RDD action methods from inside Spark or they are called from client
code. There's no need to proactively check the closure argument to
runJob for serializability or force variable capture in either case:
1. if they are called by RDD actions, their closure arguments consist
of mapping an already-serializable closure (with an already-frozen
environment) to each element in the RDD;
2. in both cases, the closure is about to execute and thus the benefit
of proactively checking for serializability (or ensuring immediate
variable capture) is nonexistent.
(Note that ensuring capture via serializability on closure arguments to
runJob also causes pyspark accumulators to fail to update.)
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
commit f4ed7535a1a1af0a518d4b7c9585073026a745c1
Author: William Benton <[email protected]>
Date: 2014-03-26T16:31:56Z
Split closure-serializability failure tests
This splits the test identifying expected failures due to
closure serializability into three cases.
commit b507dd85b0ea1595ed216ffd8226964ba671676c
Author: William Benton <[email protected]>
Date: 2014-04-04T21:39:55Z
Fixed style issues in tests
Conflicts:
core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala
commit 5284569120cff51b3d2253e03b435047258798a0
Author: William Benton <[email protected]>
Date: 2014-04-04T22:15:50Z
Stylistic changes and cleanups
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
commit d6d49304f543b5435b3062558f80ea61d2e1a757
Author: William Benton <[email protected]>
Date: 2014-05-02T14:23:49Z
Removed proactive closure serialization from DStream
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
commit c052a63dba60b62325abb7ad541fe3d3f1a6e7d0
Author: William Benton <[email protected]>
Date: 2014-07-07T20:47:41Z
Support tracking clones of broadcast variables.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---