[ 
https://issues.apache.org/jira/browse/TEZ-4416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiechuan Chen updated TEZ-4416:
-------------------------------
    Description: 
How this bug is found:

I was executing a sql with Hive on tez on a cluster that has low disk capacity. 
An exception was thrown during the execution (which is quite reasonable). Yet 
the task didn't stop normally, but keep hanging there for a very long while. 
Therefore, I printed out the jstack and did some investigation. Here's what I 
found.

(The .jstack file and the screenshot of  jstack segment are attached below.)

 

How this dead lock is triggered:
 # Fail to copy files on local disk, which will trigger copyFailed() from 
FetcherOrderedGrouped.copyFromHost(), which is a synchronized method on 
ShuffleScheduler instance. 
 # Method called from 1 will eventually goes to ShuffleScheduler.close(), in 
which it tries to kill the Referee's thread by calling referee.interrupt() and 
referee.join().
 # Meanwhile, Referee is waiting for ShuffleScheduler's instance lock in its 
run() method, which is hold by the process from 1. Hence a dead lock happens.

  was:
How this bug is found:

I was executing a sql with Hive on tez on a cluster that has low disk capacity. 
An exception was thrown during the execution (which is quite reasonable). Yet 
the task didn't stop normally, but keep hanging there for a very long while. 
Therefore, I printed out the jstack and did some investigation. Here's what I 
found.

(The .jstack file and the screenshot of  jstack segment are attached below.)

 

How this dead lock is triggered:
 # Fail to copy files on hdfs, which will trigger copyFailed() from 
FetcherOrderedGrouped.copyFromHost(), which is a synchronized method on 
ShuffleScheduler instance. 
 # Method called from 1 will eventually goes to ShuffleScheduler.close(), in 
which it tries to kill the Referee's thread by calling referee.interrupt() and 
referee.join().
 # Meanwhile, Referee is waiting for ShuffleScheduler's instance lock in its 
run() method, which is hold by the process from 1. Hence a dead lock happens.


> Dead lock triggered by ShuffleScheduler
> ---------------------------------------
>
>                 Key: TEZ-4416
>                 URL: https://issues.apache.org/jira/browse/TEZ-4416
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.10.1
>            Reporter: Jiechuan Chen
>            Priority: Major
>         Attachments: container.jstack, screenshot.PNG
>
>
> How this bug is found:
> I was executing a sql with Hive on tez on a cluster that has low disk 
> capacity. An exception was thrown during the execution (which is quite 
> reasonable). Yet the task didn't stop normally, but keep hanging there for a 
> very long while. Therefore, I printed out the jstack and did some 
> investigation. Here's what I found.
> (The .jstack file and the screenshot of  jstack segment are attached below.)
>  
> How this dead lock is triggered:
>  # Fail to copy files on local disk, which will trigger copyFailed() from 
> FetcherOrderedGrouped.copyFromHost(), which is a synchronized method on 
> ShuffleScheduler instance. 
>  # Method called from 1 will eventually goes to ShuffleScheduler.close(), in 
> which it tries to kill the Referee's thread by calling referee.interrupt() 
> and referee.join().
>  # Meanwhile, Referee is waiting for ShuffleScheduler's instance lock in its 
> run() method, which is hold by the process from 1. Hence a dead lock happens.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to