[
https://issues.apache.org/jira/browse/HADOOP-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613759#action_12613759
]
Chris Douglas commented on HADOOP-3323:
---------------------------------------
After some discussion, it's become clear that this may be completed in two
parts:
# A brief health check the namenode can perform itself
# A metrics-based solution tracking namenode throughput over time, capable of
inferring more complex and nuanced desperation
Work on (2) will fall out of a generalized metrics reporting and alerting
mechanism to be completed in concert with HADOOP-3719. The particular set of
metrics and implementation will remain in this JIRA. Specifically, the
implementation will likely correlate the size of the replication queue
(FSNamesystemMetrics::pendingReplicationBlocks) with Datanode metrics tracking
replicated blocks (DataNodeMetrics::blocksReplicated) aggregated across the
cluster. The intent would be to track replication throughput, presuming that
slow replication at the datanodes, a slow-draining replication queue, and low
storage capacity would accurately capture the conditions called out here.
In a separate JIRA, (1) will track a ping-like facility for querying the
baseline health of the Namenode. In particular, it will verify that all
expected threads are alive, perform inexpensive sanity checks on data
structures, etc. Administrators periodically running this check can
configure/attach to the notification scheme used in their deployment.
> Name node should notify administrator if when struggling with replication
> -------------------------------------------------------------------------
>
> Key: HADOOP-3323
> URL: https://issues.apache.org/jira/browse/HADOOP-3323
> Project: Hadoop Core
> Issue Type: Improvement
> Components: dfs
> Reporter: Robert Chansler
>
> Name node performance suffers if either the replication queue is to big, or
> the avail space at data nodes is too small. In either case, the administrator
> should be notified.
> If the situation is really desperate, the name node perhaps should enter safe
> mode.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.