hfutatzhanghb commented on PR #6943:
URL: https://github.com/apache/hadoop/pull/6943#issuecomment-2230839467

   > What happen here?
   
   @Hexiaoqiao Sir, we met a corner case where datanode hang because of one 
abnormal Nvme SSD disk last week.
   
   One thread get stuck in below stack because of Nvme SSD exception.
   
   ```java
   "DataXceiver for client DFSClient_NONMAPREDUCE_1772448723_85 at 
/x.x.x.x:62528 [Receiving block 
BP-1169917699-x.x.x.x-1678688680604:blk_18858524775_17785113843]" #46490764 
daemon prio=5 os_prio=0 tid=0x00007f79602ad800 nid=0xb692
    runnable [0x00007f79239c0000]
      java.lang.Thread.State: RUNNABLE
           at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
           at 
java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
           at java.io.File.exists(File.java:819)
           at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.exists(FileIoProvider.java:805)
           at 
org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:62)
           at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:389)
           at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:946)
           at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbw(FsVolumeImpl.java:1228)
           at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:1500)
           at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:221)
           at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1372)
           at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:805)
           at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:176)
           at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:110)
           at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:314)
           at java.lang.Thread.run(Thread.java:748)
   ```
   
   This causes the throughout of read/write blocks towards this datanode 
becoming zero:
   
   
![image](https://github.com/user-attachments/assets/60b601aa-0b48-4bbf-86e8-85d435c0f06d)
   
   After diving into code. we found that if one thread get stuck with holding 
dataset lock(even BP read lock),  it may cause other threads wait to acquire 
lock in AQS forever.
   
   We can refer to 
`java.util.concurrent.locks.ReentrantReadWriteLock.FairSync#readerShouldBlock` 
method.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to