[ https://issues.apache.org/jira/browse/TEZ-4416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576706#comment-17576706 ]
katty he commented on TEZ-4416: ------------------------------- thi situation can happen frequently, i had the same problem, but when i kill this application and start a new one, it wil succeed, so i am curious anout the condition where it happen > Dead lock triggered by ShuffleScheduler > --------------------------------------- > > Key: TEZ-4416 > URL: https://issues.apache.org/jira/browse/TEZ-4416 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.10.1 > Reporter: Omega-Ariston > Priority: Major > Attachments: container.jstack, screenshot.PNG > > > How this bug is found: > I was executing a sql with Hive on tez on a cluster that has low disk > capacity. An exception was thrown during the execution (which is quite > reasonable). Yet the task didn't stop normally, but keep hanging there for a > very long while. Therefore, I printed out the jstack and did some > investigation. Here's what I found. > (The .jstack file and the screenshot of jstack segment are attached below.) > > How this dead lock is triggered: > # Fail to copy files on local disk, which will trigger copyFailed() from > FetcherOrderedGrouped.copyFromHost(), which is a synchronized method on > ShuffleScheduler instance. > # Method called from 1 will eventually goes to ShuffleScheduler.close(), in > which it tries to kill the Referee's thread by calling referee.interrupt() > and referee.join(). > # Meanwhile, Referee is waiting for ShuffleScheduler's instance lock in its > run() method, which is hold by the process from 1. Hence a dead lock happens. -- This message was sent by Atlassian Jira (v8.20.10#820010)