[
https://issues.apache.org/jira/browse/HDFS-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983260#comment-16983260
]
Masatake Iwasaki commented on HDFS-15018:
-----------------------------------------
DiskChecker looks blocking in the attached stack trace.
{noformat}
"DataNode DiskChecker thread 118" #131898 daemon prio=5 os_prio=0
tid=0x00007f9e7d04e000 nid=0xe27e runnable [0x00007f9e5fea5000]
java.lang.Thread.State: RUNNABLE
at java.io.FileDescriptor.sync(Native Method)
at
org.apache.hadoop.util.DiskChecker.diskIoCheckWithoutNativeIo(DiskChecker.java:249)
at org.apache.hadoop.util.DiskChecker.doDiskIo(DiskChecker.java:220)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:82)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:339)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.check(FsVolumeImpl.java:852)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.check(FsVolumeImpl.java:84)
at
org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker$1.call(ThrottledAsyncChecker.java:142)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}
Since the DiskChecker was improved in HADOOP-15450, updating to the version
having the fix might be help.
Volume failure with {{dfs.datanode.failed.volumes.tolerated=0}} is covered in
TestDataNodeVolumeFailureToleration#testConfigureMinValidVolumes.
> DataNode doesn't shutdown although the number of failed disks reaches
> dfs.datanode.failed.volumes.tolerated
> -----------------------------------------------------------------------------------------------------------
>
> Key: HDFS-15018
> URL: https://issues.apache.org/jira/browse/HDFS-15018
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.7.3
> Environment: HDP-2.6.5
> Reporter: Toshihiro Suzuki
> Priority: Major
> Attachments: thread_dumps.txt
>
>
> In our case, we set dfs.datanode.failed.volumes.tolerated=0 but a DataNode
> didn't shutdown when a disk in the DataNode host got failed for some reason.
> The the following log messages were shown in the DataNode log which indicates
> the DataNode detected the disk failure, but the DataNode didn't shutdown:
> {code}
> 2019-09-17T13:15:43.262-0400 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskErrorAsync callback
> got 1 failed volumes: [/data2/hdfs/current]
> 2019-09-17T13:15:43.262-0400 INFO
> org.apache.hadoop.hdfs.server.datanode.BlockScanner: Removing scanner for
> volume /data2/hdfs (StorageID DS-329dec9d-a476-4334-9570-651a7e4d1f44)
> 2019-09-17T13:15:43.263-0400 INFO
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner:
> VolumeScanner(/data2/hdfs, DS-329dec9d-a476-4334-9570-651a7e4d1f44) exiting.
> {code}
> Looking at the HDFS code, it looks like when the DataNode detects a disk
> failure, DataNode waits until the volume reference of the disk is released.
> https://github.com/hortonworks/hadoop/blob/HDP-2.6.5.0-292-tag/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeList.java#L246
> I'm suspecting that the volume reference is not released after the failure
> detection, but not sure the reason.
> And we took thread dumps when the issue was happening. It looks like the
> following thread is waiting for the volume reference of the disk to be
> released:
> {code}
> "pool-4-thread-1" #174 daemon prio=5 os_prio=0 tid=0x00007f9e7c7bf800
> nid=0x8325 in Object.wait() [0x00007f9e629cb000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.waitVolumeRemoved(FsVolumeList.java:262)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.handleVolumeFailures(FsVolumeList.java:246)
> - locked <0x0000000670559278> (a java.lang.Object)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.handleVolumeFailures(FsDatasetImpl.java:2178)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.handleVolumeFailures(DataNode.java:3410)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.access$100(DataNode.java:248)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode$4.call(DataNode.java:2013)
> at
> org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker$ResultHandler.invokeCallback(DatasetVolumeChecker.java:394)
> at
> org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker$ResultHandler.cleanup(DatasetVolumeChecker.java:387)
> at
> org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker$ResultHandler.onFailure(DatasetVolumeChecker.java:370)
> at com.google.common.util.concurrent.Futures$6.run(Futures.java:977)
> at
> com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253)
> at
> org.apache.hadoop.hdfs.server.datanode.checker.AbstractFuture.executeListener(AbstractFuture.java:991)
> at
> org.apache.hadoop.hdfs.server.datanode.checker.AbstractFuture.complete(AbstractFuture.java:885)
> at
> org.apache.hadoop.hdfs.server.datanode.checker.AbstractFuture.setException(AbstractFuture.java:739)
> at
> org.apache.hadoop.hdfs.server.datanode.checker.TimeoutFuture$Fire.run(TimeoutFuture.java:137)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> We found a similar issue HDFS-13339, but we didn't see any dead lock from the
> thread dump.
> Attaching the full thread dumps of the problematic DataNode.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]