vanzin commented on a change in pull request #26924: [SPARK-30285][CORE] Fix 
deadlock between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError
URL: https://github.com/apache/spark/pull/26924#discussion_r362060992
 
 

 ##########
 File path: 
core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
 ##########
 @@ -529,6 +529,46 @@ class SparkListenerSuite extends SparkFunSuite with 
LocalSparkContext with Match
     }
   }
 
+  Seq(true, false).foreach { throwInterruptedException =>
+    val suffix = if (throwInterruptedException) "throw interrupt" else "set 
Thread interrupted"
+    test(s"SPARK-30285: Fix deadlock in AsyncEventQueue.removeListenerOnError: 
$suffix") {
+      val conf = new SparkConf(false)
+        .set(LISTENER_BUS_EVENT_QUEUE_CAPACITY, 5)
+      val bus = new LiveListenerBus(conf)
+      val counter1 = new BasicJobCounter()
+      val counter2 = new BasicJobCounter()
+      val interruptingListener = new 
DelayInterruptingJobCounter(throwInterruptedException, 3)
+      bus.addToSharedQueue(counter1)
+      bus.addToSharedQueue(interruptingListener)
+      bus.addToEventLogQueue(counter2)
+      assert(bus.activeQueues() === Set(SHARED_QUEUE, EVENT_LOG_QUEUE))
+      assert(bus.findListenersByClass[BasicJobCounter]().size === 2)
+      assert(bus.findListenersByClass[DelayInterruptingJobCounter]().size === 
1)
+
+      bus.start(mockSparkContext, mockMetricsSystem)
+
+      (0 until 5).foreach { jobId =>
+        bus.post(SparkListenerJobEnd(jobId, jobCompletionTime, JobSucceeded))
+      }
+
+      // Call bus.stop in a separate thread, otherwise we will block here 
until bus is stopped
+      val stoppingThread = new Thread(() => {
+        bus.stop()
+      })
+      stoppingThread.start()
+      // Notify interrupting listener starts to work
+      interruptingListener.sleep = false
 
 Review comment:
   You guys are trying to fabricate a test that will not be testing what the 
actual code is doing when a real app is running. That's the problem.
   
   To do that you'd need the `stop()` code in the listener bus to wait holding 
a lock while the queues are being drained; and one of those queues need to run 
into the error that causes it to remove a bad listener. That's hard to do 
without inserting callbacks that don't exist into the code; and adding those 
callbacks would only be enabling the test, which is why that's questionably.
   
   So you basically need this in the new `stop()`:
   
   ```
   def stop() {
     // do some stop stuff here
     testStartCallback()
     // clear the queues here
     testEndCallback()
   }
   ```
   
   The two callbacks are needed because otherwise there is no guarantee that 
what the queues do will happen before `stop()` does its thing.
   
   But really I don't see what really that test would be actually testing now 
that there is no synchronized block anymore.
   
   Anything you do here without these callbacks will be racy, and thus may not 
hit the original issue. Also, without the synchronized block, there's nothing 
to cause a deadlock in the first place, so that's why I said the test isn't 
that great to begin with.
   
   So I'd avoid trying to create a fancy test that isn't really testing the 
issue and just adding unneeded hooks into the main code. The current test is ok 
and as close as you'll get without the above callbacks; so either go with that, 
or just remove the test.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to