hfutatzhanghb commented on PR #6943:
URL: https://github.com/apache/hadoop/pull/6943#issuecomment-2230839467
> What happen here?
@Hexiaoqiao Sir, we met a corner case where datanode hang because of one
abnormal Nvme SSD disk last week.
One thread get stuck in below stack because of Nvme SSD exception.
```java
"DataXceiver for client DFSClient_NONMAPREDUCE_1772448723_85 at
/x.x.x.x:62528 [Receiving block
BP-1169917699-x.x.x.x-1678688680604:blk_18858524775_17785113843]" #46490764
daemon prio=5 os_prio=0 tid=0x00007f79602ad800 nid=0xb692
runnable [0x00007f79239c0000]
java.lang.Thread.State: RUNNABLE
at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
at
java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
at java.io.File.exists(File.java:819)
at
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.exists(FileIoProvider.java:805)
at
org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:62)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:389)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:946)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbw(FsVolumeImpl.java:1228)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:1500)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:221)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1372)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:805)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:176)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:110)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:314)
at java.lang.Thread.run(Thread.java:748)
```
This causes the throughout of read/write blocks towards this datanode
becoming zero:

After diving into code. we found that if one thread get stuck with holding
dataset lock(even BP read lock), it may cause other threads wait to acquire
lock in AQS forever.
We can refer to
`java.util.concurrent.locks.ReentrantReadWriteLock.FairSync#readerShouldBlock`
method.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]