GitHub user willb opened a pull request:

    https://github.com/apache/spark/pull/1322

    spark-729:  predictable closure capture

    SPARK-729 concerns when free variables in closure arguments to 
transformations are captured. Currently, it is possible for closures to get the 
environment in which they are serialized (not the environment in which they are 
created).  This PR causes free variables in closure arguments to RDD 
transformations to be captured at closure creation time by modifying 
`ClosureCleaner` to serialize and deserialize its argument.
    
    This PR is based on #189 (which is closed) but has fixes to work with some 
changes in 1.0.  In particular, it ensures that the cloned `Broadcast` objects 
produced by closure capture are registered with `ContextCleaner` so that 
broadcast variables won't become invalid simply because variable capture 
(implemented this way) causes strong references to the original broadcast 
variables to go away.
    
    (See #189 for additional discussion and background.)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/willb/spark spark-729

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1322.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1322
    
----
commit c24b7c8e92c035cd5b48acf11fb29dc060f68fca
Author: William Benton <[email protected]>
Date:   2014-07-04T16:26:59Z

    Added reference counting for Broadcasts.

commit b10f613b6debe78ec1a1d53dd7f28168f8adb359
Author: William Benton <[email protected]>
Date:   2014-07-07T17:52:26Z

    Added ContextCleaner.withCurrentCleaner
    
    This method allows code that needs access to the currently-active
    ContextCleaner to access it via a DynamicVariable.

commit 14e074a59616732d422454c1ce881bc9b29060cd
Author: William Benton <[email protected]>
Date:   2014-03-18T14:55:57Z

    Added tests for variable capture in closures
    
    The two tests added to ClosureCleanerSuite ensure that variable values
    are captured at RDD definition time, not at job-execution time.

commit 4e026a9d130c6fc92ce4cd73a68de013ed7aee5d
Author: William Benton <[email protected]>
Date:   2014-03-20T15:48:17Z

    Predictable closure environment capture
    
    The environments of serializable closures are now captured as
    part of closure cleaning. Since we already proactively check most
    closures for serializability, ClosureCleaner.clean now returns
    the result of deserializing the serialized version of the cleaned
    closure.
    
    Conflicts:
        core/src/main/scala/org/apache/spark/SparkContext.scala

commit 39960620eb9c37f92ade2c35e2d5402cad6dd686
Author: William Benton <[email protected]>
Date:   2014-03-26T04:45:45Z

    Skip proactive closure capture for runJob
    
    There are two possible cases for runJob calls: either they are called
    by RDD action methods from inside Spark or they are called from client
    code. There's no need to proactively check the closure argument to
    runJob for serializability or force variable capture in either case:
    
    1. if they are called by RDD actions, their closure arguments consist
    of mapping an already-serializable closure (with an already-frozen
    environment) to each element in the RDD;
    
    2. in both cases, the closure is about to execute and thus the benefit
    of proactively checking for serializability (or ensuring immediate
    variable capture) is nonexistent.
    
    (Note that ensuring capture via serializability on closure arguments to
    runJob also causes pyspark accumulators to fail to update.)
    
    Conflicts:
        core/src/main/scala/org/apache/spark/SparkContext.scala

commit f4ed7535a1a1af0a518d4b7c9585073026a745c1
Author: William Benton <[email protected]>
Date:   2014-03-26T16:31:56Z

    Split closure-serializability failure tests
    
    This splits the test identifying expected failures due to
    closure serializability into three cases.

commit b507dd85b0ea1595ed216ffd8226964ba671676c
Author: William Benton <[email protected]>
Date:   2014-04-04T21:39:55Z

    Fixed style issues in tests
    
    Conflicts:
        
core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala

commit 5284569120cff51b3d2253e03b435047258798a0
Author: William Benton <[email protected]>
Date:   2014-04-04T22:15:50Z

    Stylistic changes and cleanups
    
    Conflicts:
        core/src/main/scala/org/apache/spark/SparkContext.scala

commit d6d49304f543b5435b3062558f80ea61d2e1a757
Author: William Benton <[email protected]>
Date:   2014-05-02T14:23:49Z

    Removed proactive closure serialization from DStream
    
    Conflicts:
        
streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala

commit c052a63dba60b62325abb7ad541fe3d3f1a6e7d0
Author: William Benton <[email protected]>
Date:   2014-07-07T20:47:41Z

    Support tracking clones of broadcast variables.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to