[jira] [Updated] (HDFS-862) Potential NN deadlock in processDistributedUpgradeCommand

Andrey Klochkov (JIRA) Mon, 10 Sep 2012 14:44:09 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrey Klochkov updated HDFS-862:
---------------------------------

    Attachment: 
org.apache.hadoop.hdfs.server.common.TestDistributedUpgrade-output.txt

Confirming that this happens in practice, at least in tests. The 
TestDistributedUpgrade test is flaky due to this reason. We're capturing thread 
dumps of tests failing due to timeouts (HADOOP-8755) and here's the tread dump 
of TestDistributedUpgrade failure (see attachment). Thread #110 is blocked by 
#107 (or #109) and in turn #107 (109?) is blocked by #110. The first one 
acquired a monitor on the UpgradeManagerNamenode instance, and the second one 
got an fsLock, so both are waiting for each other. The test fails to start the 
cluster as DN heartbeats can't be processed by NN. 


                
> Potential NN deadlock in processDistributedUpgradeCommand
> ---------------------------------------------------------
>
>                 Key: HDFS-862
>                 URL: https://issues.apache.org/jira/browse/HDFS-862
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>         Attachments: cycle.png, 
> org.apache.hadoop.hdfs.server.common.TestDistributedUpgrade-output.txt
>
>
> Haven't seen this in practice, but the lock order is inconsistent. 
> processReport locks FSNamesystem, then calls UpgradeManager.startUpgrade, 
> getUpgradeState, and getUpgradeStatus (each of which locks the 
> UpgradeManager). FSNameSystem.processDistributedUpgradeCommand calls 
> upgradeManager.processUpgradeCommand which is synchronized on UpgradeManager, 
> which can call FSNameSystem.leaveSafeMode which synchronizes on FSNamesystem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-862) Potential NN deadlock in processDistributedUpgradeCommand

Reply via email to