[ 
https://issues.apache.org/jira/browse/HDFS-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513845#comment-15513845
 ] 

Yongjun Zhang commented on HDFS-10887:
--------------------------------------

Hi [~kihwal],

Thanks a lot for your input, and very helpful info!

Some questions:
{quote}
1) If include/exclude file is being used, we can tell whether all nodes have 
registered and heartbeated. This gives list of dead nodes, which should have 
been in service.
{quote}
I agree this is useful data. NN waits 10.5 minutes before declaring a DN dead. 
Before that, if we want to know what DNs are lagging, what I was thinking was: 
once we know the blocks that have fewer than minRepl replicas, we can search in 
all DN's block files for these blocks, to see what DNs have the blocks, and 
whether there is abnormality going on there.

{quote}
2) For the nodes that have heartbeated, we will be able to tell whether a block 
report was received for all storage volumes.
{quote}
May I know how you usually look at to see if a full block is received from a 
DN, and how to see if an incremental report is received from a DN?

{quote}
 I would force the namenode out of safe mode. That causes replication queue 
initialization and will show missing blocks.
{quote}
This is helpful. One concern of forcing NN out of safemode too early is, if 
client starts reading blocks that are missing, client will get missing block 
error instead of safemode exception, which may be handled differently at client 
side.  Right?

Thanks.





> Provide admin/debug tool to dump block map
> ------------------------------------------
>
>                 Key: HDFS-10887
>                 URL: https://issues.apache.org/jira/browse/HDFS-10887
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: hdfs, namenode
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-10887.001.patch
>
>
> From time to time, when NN restarts, we see
> {code}
> "The reported blocks X needs additional Y blocks to reach the threshold 
> 0.9990 of total blocks Z. Safe mode will be turned off automatically.
> {code}
> We'd wonder what these blocks that still need block reports are, and what DNs 
> they could possibly be located, what happened to these DNs.
> This jira to to propose a new admin or debug tool to dump the block map info 
> with the blocks that have fewer than minRepl replicas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to