[ 
https://issues.apache.org/jira/browse/SPARK-28320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882939#comment-16882939
 ] 

Martin Studer commented on SPARK-28320:
---------------------------------------

At this point I don't have a self-contained reproducible example. What I 
observed though, is that the problem seems to disappear when setting 
{{spark.cleaner.referenceTracking.cleanCheckpoints=false}} - I had this set to 
{{true}} before. I have to see whether I can isolate a (small) reproducible 
example.

> Spark job eventually fails after several "attempted to access non-existent 
> accumulator" in DAGScheduler
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-28320
>                 URL: https://issues.apache.org/jira/browse/SPARK-28320
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: Martin Studer
>            Priority: Major
>
> I'm running into an issue where a Spark 2.3.0 (Hortonworks HDP 2.6.5) job 
> eventually fails with
> {noformat}
> ERROR ApplicationMaster: User application exited with status 1
> INFO ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User 
> application exited with status 1)
> INFO SparkContext: Invoking stop() from shutdown hook
> {noformat}
> after receiving several exception of the form
> {noformat}
> ERROR DAGScheduler: Failed to update accumulators for task 0
> org.apache.spark.SparkException: attempted to access non-existent accumulator 
> 39052
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1130)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1124)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>         at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1124)
>         at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1207)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1817)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
> {noformat}
> In addition to "attempted to access non-existent accumulator" I have also 
> noticed some (but much less) instances of "Attempted to access garbage 
> collected accumulator":
> {noformat}
> ERROR DAGScheduler: Failed to update accumulators for task 0
> java.lang.IllegalStateException: Attempted to access garbage collected 
> accumulator 38352
>         at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>         at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>         at scala.Option.map(Option.scala:146)
>         at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1127)
>         at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1124)
>         at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>         at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1124)
>         at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1207)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1817)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
>         at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {noformat}
> To provide some more context: This happens in a recursive algorithm 
> implemented in pyspark where I leverage data frame checkpointing to truncate 
> the lineage graph. Checkpointing is done asynchronously by invoking the count 
> action on a different thread when recursing (using Python thread pools).
> While "attempted to access garbage collected accumulator" seems to be an 
> unexpected (illegal state) exception, it's unclear to me whether "attempted 
> to access non-existent accumulator" is an expected exception in some 
> circumstances, specifically related to checkpointing.
> The issue looks somewhat related to 
> https://issues.apache.org/jira/browse/SPARK-22371 but that issue does not 
> mention "attempted to access non-existent accumulator".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to