[ 
https://issues.apache.org/jira/browse/HDFS-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866400#comment-17866400
 ] 

ASF GitHub Bot commented on HDFS-17580:
---------------------------------------

hfutatzhanghb commented on PR #6943:
URL: https://github.com/apache/hadoop/pull/6943#issuecomment-2230839467

   > What happen here?
   
   @Hexiaoqiao Sir, we met a corner case where datanode hang because of one 
abnormal Nvme SSD disk last week.
   
   One thread get stuck in below stack because of Nvme SSD exception.
   
   ```java
   "DataXceiver for client DFSClient_NONMAPREDUCE_1772448723_85 at 
/x.x.x.x:62528 [Receiving block 
BP-1169917699-x.x.x.x-1678688680604:blk_18858524775_17785113843]" #46490764 
daemon prio=5 os_prio=0 tid=0x00007f79602ad800 nid=0xb692
    runnable [0x00007f79239c0000]
      java.lang.Thread.State: RUNNABLE
           at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
           at 
java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
           at java.io.File.exists(File.java:819)
           at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.exists(FileIoProvider.java:805)
           at 
org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:62)
           at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:389)
           at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:946)
           at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbw(FsVolumeImpl.java:1228)
           at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:1500)
           at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:221)
           at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1372)
           at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:805)
           at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:176)
           at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:110)
           at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:314)
           at java.lang.Thread.run(Thread.java:748)
   ```
   
   This causes the throughout of read/write blocks towards this datanode 
becoming zero:
   
   
![image](https://github.com/user-attachments/assets/60b601aa-0b48-4bbf-86e8-85d435c0f06d)
   
   After diving into code. we found that if one thread get stuck with holding 
dataset lock(even BP read lock),  it may cause other threads wait to acquire 
lock in AQS forever.
   
   We can refer to 
`java.util.concurrent.locks.ReentrantReadWriteLock.FairSync#readerShouldBlock` 
method.
   




> Change the default value of dfs.datanode.lock.fair to false due to potential 
> hang
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-17580
>                 URL: https://issues.apache.org/jira/browse/HDFS-17580
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>    Affects Versions: 3.4.0
>            Reporter: farmmamba
>            Assignee: farmmamba
>            Priority: Major
>              Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to