Friends, 

For context (so to speak), I did some work in the 0.9 timeframe to fix 
SPARK-897 (provide immediate feedback when closures aren't serializable) and 
SPARK-729 (make sure that free variables in closures are captured when the RDD 
transformations are declared).

I currently have a branch addressing SPARK-897 that builds and tests out 
against 0.9, 1.0, and master last I checked 
(https://github.com/apache/spark/pull/143).  My branch addressing SPARK-729 
builds on my SPARK-897 branch, and passed the test suite in 0.9[1].  However, 
some things that changed or were added in 1.0 wound up depending on the old 
behavior.  I've been working on other things lately but would like to get these 
issues fixed after 1.0 goes final so I was hoping to get a bit of discussion on 
the best way to go forward with an issue that I haven't solved yet:

ContextCleaner uses weak references to track broadcast variables.  Because weak 
references obviously don't track cloned objects (or those that have been 
serialized and deserialized), capturing free variables in closures in the 
obvious way (i.e. by replacing the closure with a copy that has been serialized 
and deserialized) results in an undesirable situation:  we might have, e.g., 
live HTTP broadcast variable objects referring to filesystem resources that 
could be cleaned at any time because the objects that they were cloned from 
have become only weakly reachable.

To be clear, this isn't a problem now; it's only a problem for the way I'm 
proposing to fix SPARK-729.  With that said, I'm wondering if it would make 
more sense to fix this problem by adding a layer of indirection to reference 
count external and persisting resources rather than the objects that putatively 
own them, or if it would make more sense to take a more sophisticated (but also 
more potentially fragile) approach to ensuring variable capture.



thanks,
wb


[1] Serializing closures also created or uncovered a PySpark issue in 0.9 (and 
presumably in later versions as well) that requires further investigation, but 
my patch did include a workaround; here are the details: 
https://issues.apache.org/jira/browse/SPARK-1454

Reply via email to