wangshuo128 commented on a change in pull request #26924: [SPARK-30285][CORE]
Fix deadlock between LiveListenerBus#stop and
AsyncEventQueue#removeListenerOnError
URL: https://github.com/apache/spark/pull/26924#discussion_r361402976
##########
File path:
core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
##########
@@ -529,6 +529,46 @@ class SparkListenerSuite extends SparkFunSuite with
LocalSparkContext with Match
}
}
+ Seq(true, false).foreach { throwInterruptedException =>
+ val suffix = if (throwInterruptedException) "throw interrupt" else "set
Thread interrupted"
+ test(s"SPARK-30285: Fix deadlock in AsyncEventQueue.removeListenerOnError:
$suffix") {
+ val conf = new SparkConf(false)
+ .set(LISTENER_BUS_EVENT_QUEUE_CAPACITY, 5)
+ val bus = new LiveListenerBus(conf)
+ val counter1 = new BasicJobCounter()
+ val counter2 = new BasicJobCounter()
+ val interruptingListener = new
DelayInterruptingJobCounter(throwInterruptedException, 3)
+ bus.addToSharedQueue(counter1)
+ bus.addToSharedQueue(interruptingListener)
+ bus.addToEventLogQueue(counter2)
+ assert(bus.activeQueues() === Set(SHARED_QUEUE, EVENT_LOG_QUEUE))
+ assert(bus.findListenersByClass[BasicJobCounter]().size === 2)
+ assert(bus.findListenersByClass[DelayInterruptingJobCounter]().size ===
1)
+
+ bus.start(mockSparkContext, mockMetricsSystem)
+
+ (0 until 5).foreach { jobId =>
+ bus.post(SparkListenerJobEnd(jobId, jobCompletionTime, JobSucceeded))
+ }
+
+ // Call bus.stop in a separate thread, otherwise we will block here
until bus is stopped
+ val stoppingThread = new Thread(() => {
+ bus.stop()
+ })
+ stoppingThread.start()
+ // Notify interrupting listener starts to work
+ interruptingListener.sleep = false
Review comment:
Unfortunately, checking the `stoped` status can't guarantee this. It's
likely that the bus has already set the `stoped` status to true, but has not
acquired the synchronized lock yet.
To avoid racing, we can pass a callback into `bus.stop`, and notify the
interrupting listener in the callback.
```scala
def stop(callback: Option[() => Unit]): Unit = {
if (!started.get()) {
throw new IllegalStateException(s"Attempted to stop bus that has not
yet started!")
}
if (!stopped.compareAndSet(false, true)) {
return
}
synchronized {
callback.foreach { c => c() }
queues.asScala.foreach(_.stop())
queues.clear()
}
}
...
// in the test
bus.stop(Some(() => interruptingListener.sleep = false))
```
But that's weird.
1. It changes the `bus.stop` API.
2. The synchronized in `bus.stop` has already been removed. It's not
desirable to do that for just testing an old bug.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]