skonto commented on a change in pull request #24796: [SPARK-27900][CORE] Add 
uncaught exception handler to the driver
URL: https://github.com/apache/spark/pull/24796#discussion_r290532870
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala
 ##########
 @@ -204,6 +204,11 @@ private [util] class SparkShutdownHookManager {
     hooks.synchronized { hooks.remove(ref) }
   }
 
+  def clear(): Unit = {
 
 Review comment:
   @srowen here it is: 
https://gist.github.com/skonto/74181e434a727901d4f3323461c1050b
   I commented out the clear call. One other (indepedent) thing I noticed is 
that the main thread is also stuck here:
   
https://github.com/apache/spark/blob/bfb3ffe9b33a403a1f3b6f5407d34a477ce62c85/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L736
   Blocking for ever there might be a problem if something goes wrong.
   Now if you check the output:
   
   ```
   "Thread-1" #10 prio=5 os_prio=0 tid=0x000055d323902000 nid=0x7c in 
Object.wait() [0x00007fdccd08a000]
      java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1252)
        - locked <0x00000000ebe00e50> (a 
org.apache.spark.util.EventLoop$$anon$1)
        at java.lang.Thread.join(Thread.java:1326)
        at org.apache.spark.util.EventLoop.stop(EventLoop.scala:81)
        at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2100)
   ```
   this will never finish and it waits at the join for 
`dag-scheduler-event-loop`:
   ```
   "dag-scheduler-event-loop" #45 daemon prio=5 os_prio=0 
tid=0x000055d323a25000 nid=0x48 in Object.wait() [0x00007fdccd6d2000]
      java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1252)
        - locked <0x00000000eb4f3b58> (a 
org.apache.hadoop.util.ShutdownHookManager$1)
        at java.lang.Thread.join(Thread.java:1326)
        at 
java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:107)
        at 
java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
        at java.lang.Shutdown.runHooks(Shutdown.java:123)
        at java.lang.Shutdown.sequence(Shutdown.java:167)
        at java.lang.Shutdown.exit(Shutdown.java:212)
        - locked <0x00000000eb3848b8> (a java.lang.Class for java.lang.Shutdown)
        at java.lang.Runtime.exit(Runtime.java:109)
        at java.lang.System.exit(System.java:971)
        at 
org.apache.spark.util.SparkUncaughtExceptionHandler.sysExit(SparkUncaughtExceptionHandler.scala:35)
        at 
org.apache.spark.util.SparkUncaughtExceptionHandler.uncaughtException(SparkUncaughtExceptionHandler.scala:53)
        at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1057)
        at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1052)
        at java.lang.Thread.dispatchUncaughtException(Thread.java:1959)
   ```
   which waits for the shutodwn hook to finish which is invoked by the oom 
which was created by itself. So its a deadlock. If you check the log oom comes 
for that thread which tries to submit 1M tasks ;).  dag-scheduler-event-loop -> 
shutdownHook -> calls join from the other thread and waits for 
dag-scheduler-event-loop (deadlock).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to