[ 
https://issues.apache.org/jira/browse/HDFS-10887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523305#comment-15523305
 ] 

Kihwal Lee commented on HDFS-10887:
-----------------------------------

Missing blocks during start-up is caused by datanodes not sending full block 
reports. They might be dead, having problem registering, stuck in 
initialization or caused by any bug or error conditions.  They key is to find 
those nodes quickly.  By simply dumping the blocksmap, you will find which 
blocks are being affected, but where do you go from there?  How do you identify 
the datanodes that are not reporting?

bq. May I know how you usually look at to see if a full block is received from 
a DN, and how to see if an incremental report is received from a DN?
I visit NN webUI to see how many blocks each datanode has. In the old days, 
when each datanode was a single storage, it was easy to tell. Even today, it is 
rare that a datanode only sends FBR for subset of live storages, so it is a 
good indicator.  But before checking block reports, I first look at dead nodes. 
  During the start-up safe mode, IBRs are irrelevant. It is to standby, but 
that's a whole different discussion. :)

bq. One concern of forcing NN out of safemode too early is, if client starts 
reading blocks that are missing, client will get missing block error instead of 
safemode exception, which may be handled differently at client side. Right?
This is often the last resort when the problematic datanode cannot be 
identified quickly.  With a cluster with 100s of million blocks, restoring 
service in time is more important than blocking everyone until several missing 
blocks are fixed.  But with the HA, we now rarely have availability problems. 
Our rolling upgrade script does a number of sanity checks before failing over, 
so issues are reported and fixed before it affects the service. Most common 
case is datanodes talking to only one namenode for various reasons. Many bugs 
have been fixed and we don't see them often nowadays.

> Provide admin/debug tool to dump block map
> ------------------------------------------
>
>                 Key: HDFS-10887
>                 URL: https://issues.apache.org/jira/browse/HDFS-10887
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: hdfs, namenode
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-10887.001.patch, HDFS-10887.002.patch
>
>
> From time to time, when NN restarts, we see
> {code}
> "The reported blocks X needs additional Y blocks to reach the threshold 
> 0.9990 of total blocks Z. Safe mode will be turned off automatically.
> {code}
> We'd wonder what these blocks that still need block reports are, and what DNs 
> they could possibly be located, what happened to these DNs.
> This jira to to propose a new admin or debug tool to dump the block map info 
> with the blocks that have fewer than minRepl replicas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to