[ 
https://issues.apache.org/jira/browse/SPARK-6075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6075:
------------------------------
    Description: 
It looks like some of the AccumulatorSuite tests have started failing 
nondeterministically on Jenkins.  The errors seem to be due to lost / missing 
accumulator updates, e.g.

{code}
Set(843, 356, 437, [...], 181, 618, 131) did not contain element 901
{code}

This could somehow be related to SPARK-3885 / 
https://github.com/apache/spark/pull/4021, a patch to garbage-collect 
accumulators, which was only merged into master.

https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/lastCompletedBuild/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/org.apache.spark/AccumulatorSuite/add_value_to_collection_accumulators/

I think I've figured it out: consider the lifecycle of an accumulator in a 
task, say ShuffleMapTask: on the executor, each task deserializes its own copy 
of the RDD inside of its runTask method, so the strong reference to the RDD 
disappears at the end of runTask. In Executor.run(), we call 
Accumulators.values after runTask has exited, so there's a small window in 
which the tasks's RDD can be GC'd, causing accumulators to be GC'd as well 
because there are no longer any strong references to them.

The fix is to keep strong references in localAccums, since we clear this at the 
end of each task anyways. I'm glad that I was able to figure out precisely why 
this was necessary and sorry that I missed this during review; I'll submit a 
fix shortly. In terms of preventative measures, it might be a good idea to 
write up the lifetime / lifecycle of objects' strong references whenever we're 
using WeakReferences, since the process of explicitly writing that out would 
prevent these sorts of mistakes in the future.

  was:
It looks like some of the AccumulatorSuite tests have started failing 
nondeterministically on Jenkins.  The errors seem to be due to lost / missing 
accumulator updates, e.g.

{code}
Set(843, 356, 437, [...], 181, 618, 131) did not contain element 901
{code}

This could somehow be related to SPARK-3885 / 
https://github.com/apache/spark/pull/4021, a patch to garbage-collect 
accumulators, which was only merged into master.

https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/lastCompletedBuild/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/org.apache.spark/AccumulatorSuite/add_value_to_collection_accumulators/


> After SPARK-3885, some tasks' accumulator updates may be lost
> -------------------------------------------------------------
>
>                 Key: SPARK-6075
>                 URL: https://issues.apache.org/jira/browse/SPARK-6075
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Tests
>    Affects Versions: 1.4.0
>            Reporter: Josh Rosen
>            Priority: Critical
>
> It looks like some of the AccumulatorSuite tests have started failing 
> nondeterministically on Jenkins.  The errors seem to be due to lost / missing 
> accumulator updates, e.g.
> {code}
> Set(843, 356, 437, [...], 181, 618, 131) did not contain element 901
> {code}
> This could somehow be related to SPARK-3885 / 
> https://github.com/apache/spark/pull/4021, a patch to garbage-collect 
> accumulators, which was only merged into master.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/lastCompletedBuild/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/org.apache.spark/AccumulatorSuite/add_value_to_collection_accumulators/
> I think I've figured it out: consider the lifecycle of an accumulator in a 
> task, say ShuffleMapTask: on the executor, each task deserializes its own 
> copy of the RDD inside of its runTask method, so the strong reference to the 
> RDD disappears at the end of runTask. In Executor.run(), we call 
> Accumulators.values after runTask has exited, so there's a small window in 
> which the tasks's RDD can be GC'd, causing accumulators to be GC'd as well 
> because there are no longer any strong references to them.
> The fix is to keep strong references in localAccums, since we clear this at 
> the end of each task anyways. I'm glad that I was able to figure out 
> precisely why this was necessary and sorry that I missed this during review; 
> I'll submit a fix shortly. In terms of preventative measures, it might be a 
> good idea to write up the lifetime / lifecycle of objects' strong references 
> whenever we're using WeakReferences, since the process of explicitly writing 
> that out would prevent these sorts of mistakes in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to