[
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167154#comment-14167154
]
Ming Ma commented on HDFS-7208:
-------------------------------
Thanks, Daryn. We can do #3, but want to put the approaches in the following
ways, in part to clarify the design of heterogeneous storage. [~arpitagarwal]
and others might have more input here. Note that
dfs.datanode.failed.volumes.tolerated > 0 in the discussion.
1. Have DN eventually deliver failed storage notification. Prior to
heterogeneous storage, NN detects missing replicas on the failed storage via
BR. So if we use BR to report failed storage, we are on par in terms of time to
detect metrics. However, we have to make sure DN eventually deliver the failed
storage notification in all cases. hotswap is one scenario. Here is another
scenario, a) A storage fails. b) DN restarts prior to the next BR. c) DN
couldn't send BR after restart as it excluded the failed storage during
startup. To address this, we can persist storage ids that DN need to BR on,
perhaps on other healthy storages.
2. Have DN timely deliver failed storage notification. DN provides
StorageReport via HB. With this NN could detect failed storage much faster.
This will greatly improve time to detect metrics. But this requires HB to take
the FSNS write lock. We can make it async without FSNS write lock. This can be
done in a separate jira.
3. Time out on DN storage notification. Similar to how NN use HB to manage DN,
we can have HB for each storage. There should be some max time out value of
notification for any given storage. But if the design of heterogeneous storage
is to allow a DN to use different BR intervals for different storages, we could
potentially have much larger value of BR for a given storage.
> NN doesn't schedule replication when a DN storage fails
> -------------------------------------------------------
>
> Key: HDFS-7208
> URL: https://issues.apache.org/jira/browse/HDFS-7208
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Ming Ma
>
> We found the following problem. When a storage device on a DN fails, NN
> continues to believe replicas of those blocks on that storage are valid and
> doesn't schedule replication.
> A DN has 12 storage disks. So there is one blockReport for each storage. When
> a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given
> dfs.datanode.failed.volumes.tolerated is configured to be > 0, NN still
> considers that DN healthy.
> 1. A disk failed. All blocks of that disk are removed from DN dataset.
>
> {noformat}
> 2014-10-04 02:11:12,626 WARN
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing
> replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume
> /data/disk6/dfs/current
> {noformat}
> 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN
> remove the DN and the replicas from the BlocksMap. In addition, blockReport
> doesn't provide the diff given that is done per storage.
> {noformat}
> 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode:
> Disk error on DatanodeRegistration(xx.xx.xx.xxx,
> datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075,
> ipcPort=50020,
> storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
> DataNode failed volumes:/data/disk6/dfs/current
> {noformat}
> 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)