Andrew Onischuk created AMBARI-5681:
---------------------------------------

             Summary: Add Nagios alert if HDFS last checkpoint time exceeds 
threshold
                 Key: AMBARI-5681
                 URL: https://issues.apache.org/jira/browse/AMBARI-5681
             Project: Ambari
          Issue Type: Bug
            Reporter: Andrew Onischuk
            Assignee: Andrew Onischuk
             Fix For: 1.6.0


Description: If the secondary NameNode(SNN) failed to merge edit files for any
reason, Nagios doesn't alert on it.

PROBLEM: For any reasons, SNN fails to merge edit files for long time it goes
undetected. This can cause the edit files to become very large and slows down
NameNode performance. And in some cases, can lead to corruption of NameNode
edit files.  
BUSINESS IMPACT: If Nagios doesn't alert on SNN functionality, this will
eventually cause long downtime for all of customers and a possiblitly of data
loss.

STEPS TO REPRODUCE:

  * SNN fails to merge edit files for any reason
  * NameNode edit files grow in size
  * Corruption to edit files.

ACTUAL BEHAVIOR: Nagios doesn't fire critical alarm  
EXPECTED BEHAVIOR: Nagios should fire critical alarm

SUPPORT ANALYSIS: N/A

Note:

We need to get this fixed and alert our customers to add the nagios alarm
ASAP.





--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to