[ 
https://issues.apache.org/jira/browse/HADOOP-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613759#action_12613759
 ] 

Chris Douglas commented on HADOOP-3323:
---------------------------------------

After some discussion, it's become clear that this may be completed in two 
parts:

# A brief health check the namenode can perform itself
# A metrics-based solution tracking namenode throughput over time, capable of 
inferring more complex and nuanced desperation

Work on (2) will fall out of a generalized metrics reporting and alerting 
mechanism to be completed in concert with HADOOP-3719. The particular set of 
metrics and implementation will remain in this JIRA. Specifically, the 
implementation will likely correlate the size of the replication queue 
(FSNamesystemMetrics::pendingReplicationBlocks) with Datanode metrics tracking 
replicated blocks (DataNodeMetrics::blocksReplicated) aggregated across the 
cluster. The intent would be to track replication throughput, presuming that 
slow replication at the datanodes, a slow-draining replication queue, and low 
storage capacity would accurately capture the conditions called out here.

In a separate JIRA, (1) will track a ping-like facility for querying the 
baseline health of the Namenode. In particular, it will verify that all 
expected threads are alive, perform inexpensive sanity checks on data 
structures, etc. Administrators periodically running this check can 
configure/attach to the notification scheme used in their deployment.

> Name node should notify administrator if when struggling with replication
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3323
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3323
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Robert Chansler
>
> Name node performance suffers if either the replication queue is to big, or 
> the avail space at data nodes is too small. In either case, the administrator 
> should be notified.
> If the situation is really desperate, the name node perhaps should enter safe 
> mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to