Imran Rashid created SPARK-20128:
------------------------------------
Summary: MetricsSystem not always killed in SparkContext.stop()
Key: SPARK-20128
URL: https://issues.apache.org/jira/browse/SPARK-20128
Project: Spark
Issue Type: Test
Components: Spark Core, Tests
Affects Versions: 2.2.0
Reporter: Imran Rashid
One Jenkins run failed due to the MetricsSystem never getting killed after a
failed test, which led that test to hang and the tests to timeout:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176
{noformat}
17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR
DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting
down SparkContext
java.lang.ArrayIndexOutOfBoundsException: -1
at
org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431)
at
org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430)
at scala.Option.flatMap(Option.scala:171)
at
org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO
MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared
17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager stopped
17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster:
BlockManagerMaster stopped
17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully
stopped SparkContext
17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR
ScheduledReporter: RuntimeException thrown from ConsoleReporter#report.
Exception was suppressed.
java.lang.NullPointerException
at
org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35)
at
org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34)
at
com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239)
...
{noformat}
unfortunately I didn't save the entire test logs, but what happens is the
initial IndexOutOfBoundsException is a real bug, which causes the SparkContext
to stop, and the test to fail. However, the MetricsSystem somehow stays alive,
and since its not a daemon thread, it just hangs, and every 20 mins we get that
NPE from within the metrics system as it tries to report.
I am totally perplexed at how this can happen, it looks like the metric system
should always get stopped by the time we see
{noformat}
17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully
stopped SparkContext
{noformat]
I don't think I've ever seen this in a real spark use, but it doesn't look like
something which is limited to tests, whatever the cause.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]