[
https://issues.apache.org/jira/browse/HDFS-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866400#comment-17866400
]
ASF GitHub Bot commented on HDFS-17580:
---------------------------------------
hfutatzhanghb commented on PR #6943:
URL: https://github.com/apache/hadoop/pull/6943#issuecomment-2230839467
> What happen here?
@Hexiaoqiao Sir, we met a corner case where datanode hang because of one
abnormal Nvme SSD disk last week.
One thread get stuck in below stack because of Nvme SSD exception.
```java
"DataXceiver for client DFSClient_NONMAPREDUCE_1772448723_85 at
/x.x.x.x:62528 [Receiving block
BP-1169917699-x.x.x.x-1678688680604:blk_18858524775_17785113843]" #46490764
daemon prio=5 os_prio=0 tid=0x00007f79602ad800 nid=0xb692
runnable [0x00007f79239c0000]
java.lang.Thread.State: RUNNABLE
at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
at
java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
at java.io.File.exists(File.java:819)
at
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.exists(FileIoProvider.java:805)
at
org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:62)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:389)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:946)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbw(FsVolumeImpl.java:1228)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:1500)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:221)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1372)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:805)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:176)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:110)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:314)
at java.lang.Thread.run(Thread.java:748)
```
This causes the throughout of read/write blocks towards this datanode
becoming zero:

After diving into code. we found that if one thread get stuck with holding
dataset lock(even BP read lock), it may cause other threads wait to acquire
lock in AQS forever.
We can refer to
`java.util.concurrent.locks.ReentrantReadWriteLock.FairSync#readerShouldBlock`
method.
> Change the default value of dfs.datanode.lock.fair to false due to potential
> hang
> ---------------------------------------------------------------------------------
>
> Key: HDFS-17580
> URL: https://issues.apache.org/jira/browse/HDFS-17580
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode
> Affects Versions: 3.4.0
> Reporter: farmmamba
> Assignee: farmmamba
> Priority: Major
> Labels: pull-request-available
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]