ankurdave commented on pull request #34245:
URL: https://github.com/apache/spark/pull/34245#issuecomment-940401210


   Hmm, thanks for pointing this out. 
[`TaskContextImpl.markTask{Completed,Failed}`](https://github.com/apache/spark/blob/20051eb69904de6afc27fe5adb18bcc760c78701/core/src/main/scala/org/apache/spark/TaskContextImpl.scala#L121)
 actually does hold the TaskContext lock while invoking the listeners. As a 
result, I think the following sequence of events can produce a deadlock:
   
   1. The main thread acquires the lock on TaskContextImpl and begins invoking 
the task completion listeners.
   2. The writer thread attempts to acquire the lock on TaskContextImpl and 
blocks.
   3. The main thread interrupts the writer thread.
   4. The writer thread aborts its lock wait and 
[handles](https://github.com/apache/spark/blob/20051eb69904de6afc27fe5adb18bcc760c78701/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L430)
 the InterruptedException. The exception handler calls 
`TaskContextImpl#isCompleted()`, which again attempts to acquire the lock on 
TaskContextImpl, resulting in a deadlock.
   
   We can fix this by releasing the TaskContext lock before invoking the 
listeners. I'll update the PR with that change and try to write a test to repro 
the deadlock.
   
   cc @viirya @zsxwing 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to