[ 
https://issues.apache.org/jira/browse/HDFS-15641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218116#comment-17218116
 ] 

Hongbing Wang commented on HDFS-15641:
--------------------------------------

Thanks [~hexiaoqiao] for attention. There may be a bit of confusion here. 
*lifelineSender.start()* does not refer to the start of the thread. 
LifelineSender has rewritten the start() method, as follows:
{code:java}
// BPServiceActor$LifelineSender#start
public void start() {
  lifelineThread = new Thread(this,
      formatThreadName("lifeline", lifelineNnAddr)); // formatThreadName occurs 
deadlock
  lifelineThread.setDaemon(true);
  //...
  lifelineThread.start(); //Thread start here
}
// formatThreadName
private String formatThreadName(
    final String action,
    final InetSocketAddress addr) {
  String bpId = bpos.getBlockPoolId(true);
  //...
}
// getBlockPoolId
String getBlockPoolId(boolean quiet) {
  // avoid lock contention unless the registration hasn't completed.
  String id = bpId;
  if (id != null) {
    return id;
  }
  DataNodeFaultInjector.get().delayWhenOfferServiceHoldLock();
  readLock(); // deadlock occurs here
  //...
}{code}
To be precise, the deadlock occurs in the `refreshThread` and `bpThread`. 
Deadlock is related to the above *start ->* *formatThreadName -> getBlockPoolId 
-> readLock and readUnlock* . So, I promise to let _readLock and readUnlock_ is 
completely executed before starting `bpThread`.

The test I given can reproduce the deadlock before the fix, and test passed  
after the fix.

Thanks [~hexiaoqiao] again.

> DataNode could meet deadlock if invoke refreshNameNode
> ------------------------------------------------------
>
>                 Key: HDFS-15641
>                 URL: https://issues.apache.org/jira/browse/HDFS-15641
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.2.0
>            Reporter: Hongbing Wang
>            Assignee: Hongbing Wang
>            Priority: Critical
>         Attachments: HDFS-15641.001.patch, HDFS-15641.002.patch, 
> deadlock.png, jstack.log
>
>
> DataNode could meet deadlock when invoke `hdfs dfsadmin -refreshNamenodes 
> hostname:50020` to register a new namespace in federation env.
> The jstack is shown in jstack.log
>  The specific process is shown in Figure deadlock.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to