[ https://issues.apache.org/jira/browse/HDFS-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Takanobu Asanuma resolved HDFS-15018. ------------------------------------- Resolution: Duplicate > DataNode doesn't shutdown although the number of failed disks reaches > dfs.datanode.failed.volumes.tolerated > ----------------------------------------------------------------------------------------------------------- > > Key: HDFS-15018 > URL: https://issues.apache.org/jira/browse/HDFS-15018 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.7.3 > Environment: HDP-2.6.5 > Reporter: Toshihiro Suzuki > Priority: Major > Attachments: thread_dumps.txt > > > In our case, we set dfs.datanode.failed.volumes.tolerated=0 but a DataNode > didn't shutdown when a disk in the DataNode host got failed for some reason. > The the following log messages were shown in the DataNode log which indicates > the DataNode detected the disk failure, but the DataNode didn't shutdown: > {code} > 2019-09-17T13:15:43.262-0400 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskErrorAsync callback > got 1 failed volumes: [/data2/hdfs/current] > 2019-09-17T13:15:43.262-0400 INFO > org.apache.hadoop.hdfs.server.datanode.BlockScanner: Removing scanner for > volume /data2/hdfs (StorageID DS-329dec9d-a476-4334-9570-651a7e4d1f44) > 2019-09-17T13:15:43.263-0400 INFO > org.apache.hadoop.hdfs.server.datanode.VolumeScanner: > VolumeScanner(/data2/hdfs, DS-329dec9d-a476-4334-9570-651a7e4d1f44) exiting. > {code} > Looking at the HDFS code, it looks like when the DataNode detects a disk > failure, DataNode waits until the volume reference of the disk is released. > https://github.com/hortonworks/hadoop/blob/HDP-2.6.5.0-292-tag/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeList.java#L246 > I'm suspecting that the volume reference is not released after the failure > detection, but not sure the reason. > And we took thread dumps when the issue was happening. It looks like the > following thread is waiting for the volume reference of the disk to be > released: > {code} > "pool-4-thread-1" #174 daemon prio=5 os_prio=0 tid=0x00007f9e7c7bf800 > nid=0x8325 in Object.wait() [0x00007f9e629cb000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.waitVolumeRemoved(FsVolumeList.java:262) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.handleVolumeFailures(FsVolumeList.java:246) > - locked <0x0000000670559278> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.handleVolumeFailures(FsDatasetImpl.java:2178) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.handleVolumeFailures(DataNode.java:3410) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.access$100(DataNode.java:248) > at > org.apache.hadoop.hdfs.server.datanode.DataNode$4.call(DataNode.java:2013) > at > org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker$ResultHandler.invokeCallback(DatasetVolumeChecker.java:394) > at > org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker$ResultHandler.cleanup(DatasetVolumeChecker.java:387) > at > org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker$ResultHandler.onFailure(DatasetVolumeChecker.java:370) > at com.google.common.util.concurrent.Futures$6.run(Futures.java:977) > at > com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253) > at > org.apache.hadoop.hdfs.server.datanode.checker.AbstractFuture.executeListener(AbstractFuture.java:991) > at > org.apache.hadoop.hdfs.server.datanode.checker.AbstractFuture.complete(AbstractFuture.java:885) > at > org.apache.hadoop.hdfs.server.datanode.checker.AbstractFuture.setException(AbstractFuture.java:739) > at > org.apache.hadoop.hdfs.server.datanode.checker.TimeoutFuture$Fire.run(TimeoutFuture.java:137) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > We found a similar issue HDFS-13339, but we didn't see any dead lock from the > thread dump. > Attaching the full thread dumps of the problematic DataNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org