[ 
https://issues.apache.org/jira/browse/HDFS-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513408#comment-15513408
 ] 

Kihwal Lee commented on HDFS-10887:
-----------------------------------

Perhaps, a less expensive diagnostics will be looking at registration and block 
reports.
1) If include/exclude file is being used, we can tell whether all nodes have 
registered and heartbeated.  This gives list of dead nodes, which should have 
been in service. It does not necessarily mean missing or under-replication, but 
is still useful data point.  Also, when the nodes heartbeat, we know the number 
of storages on each node.
2) For the nodes that have heartbeated, we will be able to tell whether a block 
report was received for all storage volumes.

It is far less expensive to get the list of node/storage with no block report 
received. Combined with the deadnode list, admins will have a good idea on 
which node/storage to look at.  Dumping the blocksmap can be useful, but it 
only tells you what is missing. It doesn't tell you the potential source of 
problems.  If I am operating a cluster, I would use the hypothetical tool I 
mentioned first.  If that does not resolve the issue, I would force the 
namenode out of safe mode. That causes replication queue initialization and 
will show missing blocks.  

> Provide admin/debug tool to dump block map
> ------------------------------------------
>
>                 Key: HDFS-10887
>                 URL: https://issues.apache.org/jira/browse/HDFS-10887
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs, namenode
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-10887.001.patch
>
>
> From time to time, when NN restarts, we see
> {code}
> "The reported blocks X needs additional Y blocks to reach the threshold 
> 0.9990 of total blocks Z. Safe mode will be turned off automatically.
> {code}
> We'd wonder what these blocks that still need block reports are, and what DNs 
> they could possibly be located, what happened to these DNs.
> This jira to to propose a new admin or debug tool to dump the block map info 
> with the blocks that have fewer than minRepl replicas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to