[
https://issues.apache.org/jira/browse/SPARK-28320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882613#comment-16882613
]
Hyukjin Kwon commented on SPARK-28320:
--------------------------------------
Is it possible to provide a reproducer? Seems difficult to verify without
knowing how to reproduce.
> Spark job eventually fails after several "attempted to access non-existent
> accumulator" in DAGScheduler
> -------------------------------------------------------------------------------------------------------
>
> Key: SPARK-28320
> URL: https://issues.apache.org/jira/browse/SPARK-28320
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.3.0
> Reporter: Martin Studer
> Priority: Major
>
> I'm running into an issue where a Spark 2.3.0 (Hortonworks HDP 2.6.5) job
> eventually fails with
> {noformat}
> ERROR ApplicationMaster: User application exited with status 1
> INFO ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User
> application exited with status 1)
> INFO SparkContext: Invoking stop() from shutdown hook
> {noformat}
> after receiving several exception of the form
> {noformat}
> ERROR DAGScheduler: Failed to update accumulators for task 0
> org.apache.spark.SparkException: attempted to access non-existent accumulator
> 39052
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1130)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1124)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1124)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1207)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1817)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
> {noformat}
> In addition to "attempted to access non-existent accumulator" I have also
> noticed some (but much less) instances of "Attempted to access garbage
> collected accumulator":
> {noformat}
> ERROR DAGScheduler: Failed to update accumulators for task 0
> java.lang.IllegalStateException: Attempted to access garbage collected
> accumulator 38352
> at
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
> at
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
> at scala.Option.map(Option.scala:146)
> at
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1127)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1124)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1124)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1207)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1817)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {noformat}
> To provide some more context: This happens in a recursive algorithm
> implemented in pyspark where I leverage data frame checkpointing to truncate
> the lineage graph. Checkpointing is done asynchronously by invoking the count
> action on a different thread when recursing (using Python thread pools).
> While "attempted to access garbage collected accumulator" seems to be an
> unexpected (illegal state) exception, it's unclear to me whether "attempted
> to access non-existent accumulator" is an expected exception in some
> circumstances, specifically related to checkpointing.
> The issue looks somewhat related to
> https://issues.apache.org/jira/browse/SPARK-22371 but that issue does not
> mention "attempted to access non-existent accumulator".
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]