[
https://issues.apache.org/jira/browse/HDFS-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrey Klochkov updated HDFS-862:
---------------------------------
Attachment:
org.apache.hadoop.hdfs.server.common.TestDistributedUpgrade-output.txt
Confirming that this happens in practice, at least in tests. The
TestDistributedUpgrade test is flaky due to this reason. We're capturing thread
dumps of tests failing due to timeouts (HADOOP-8755) and here's the tread dump
of TestDistributedUpgrade failure (see attachment). Thread #110 is blocked by
#107 (or #109) and in turn #107 (109?) is blocked by #110. The first one
acquired a monitor on the UpgradeManagerNamenode instance, and the second one
got an fsLock, so both are waiting for each other. The test fails to start the
cluster as DN heartbeats can't be processed by NN.
> Potential NN deadlock in processDistributedUpgradeCommand
> ---------------------------------------------------------
>
> Key: HDFS-862
> URL: https://issues.apache.org/jira/browse/HDFS-862
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: name-node
> Affects Versions: 0.22.0
> Reporter: Todd Lipcon
> Attachments: cycle.png,
> org.apache.hadoop.hdfs.server.common.TestDistributedUpgrade-output.txt
>
>
> Haven't seen this in practice, but the lock order is inconsistent.
> processReport locks FSNamesystem, then calls UpgradeManager.startUpgrade,
> getUpgradeState, and getUpgradeStatus (each of which locks the
> UpgradeManager). FSNameSystem.processDistributedUpgradeCommand calls
> upgradeManager.processUpgradeCommand which is synchronized on UpgradeManager,
> which can call FSNameSystem.leaveSafeMode which synchronizes on FSNamesystem.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira