[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

Tsz Wo Nicholas Sze (JIRA) Wed, 15 Oct 2014 14:05:46 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172940#comment-14172940
 ]


Tsz Wo Nicholas Sze commented on HDFS-7208:
-------------------------------------------

> The latest patch addresses all your comments, except for the allAlive one. 
> The reason is the patch handles deadnode separately from the failedStorage.

We need to change allAlive.  Otherwise, the while loop won't work if there is 
only failed storage.  Of course, we also need to update the if-condition for 
dead datanode.  Here is my suggestion:
{code}
    while (!allAlive) {
      ...
      allAlive = dead == null && failedStorage == null;
      if (dead != null) {
        ...
      }
      ...
    }
{code}

We should also call namesystem.checkSafeMode() in removeBlocksAssociatedTo(..).

> NN doesn't schedule replication when a DN storage fails
> -------------------------------------------------------
>
>                 Key: HDFS-7208
>                 URL: https://issues.apache.org/jira/browse/HDFS-7208
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HDFS-7208-2.patch, HDFS-7208.patch
>
>
> We found the following problem. When a storage device on a DN fails, NN 
> continues to believe replicas of those blocks on that storage are valid and 
> doesn't schedule replication.
> A DN has 12 storage disks. So there is one blockReport for each storage. When 
> a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given 
> dfs.datanode.failed.volumes.tolerated is configured to be > 0, NN still 
> considers that DN healthy.
> 1. A disk failed. All blocks of that disk are removed from DN dataset.
>  
> {noformat}
> 2014-10-04 02:11:12,626 WARN 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing 
> replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume 
> /data/disk6/dfs/current
> {noformat}
> 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN 
> remove the DN and the replicas from the BlocksMap. In addition, blockReport 
> doesn't provide the diff given that is done per storage.
> {noformat}
> 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
> Disk error on DatanodeRegistration(xx.xx.xx.xxx, 
> datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, 
> ipcPort=50020, 
> storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939):
>  DataNode failed volumes:/data/disk6/dfs/current
> {noformat}
> 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails

Reply via email to