[
https://issues.apache.org/jira/browse/AMBARI-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lysnichenko updated AMBARI-7791:
---------------------------------------
Attachment: AMBARI-7791_branch-1.7.0.patch
> HBase Master CPU utilization alert is not suppressed at MM
> ----------------------------------------------------------
>
> Key: AMBARI-7791
> URL: https://issues.apache.org/jira/browse/AMBARI-7791
> Project: Ambari
> Issue Type: Bug
> Components: ambari-server
> Affects Versions: 1.7.0
> Reporter: Dmitry Lysnichenko
> Assignee: Dmitry Lysnichenko
> Fix For: 1.7.0
>
> Attachments: AMBARI-7791_branch-1.7.0.patch
>
>
> Looks like we have a design flaw that affects suppressing some alerts. It
> causes a rare bug that probably affects 1.6.1.
> h2. The short story
> When we put HBase Master (or entire HBase service) into MM and then stop
> HBase Master, the alert "HBase Master CPU utilization" pops up and is not
> suppressed. This issue reproduces only when HBase Master is located on a
> separate host then Nagios server.
> h2. How suppressing alerts works
> When we put some service/host/host component into MM, at the server we build
> a complete map of host components that are in MM and post it to an agent.
> Agent writes down this info to file /var/nagios/ignore.dat in a form:
> {code}
> vm-3.vm GANGLIA GANGLIA_MONITOR
> vm-0.vm HBASE HBASE_MASTER
> vm-3.vm HDFS DATANODE
> vm-2.vm HBASE HBASE_REGIONSERVER
> vm-0.vm HBASE HBASE_REGIONSERVER
> vm-1.vm HBASE HBASE_REGIONSERVER
> vm-3.vm YARN NODEMANAGER
> vm-3.vm HBASE HBASE_REGIONSERVER
> {code}
> All alerts at Nagios are wrapped into shell script (check_wrapper.sh). When
> any alert is generated, this wrapper checks if the hostname, service name
> and component name for this alert are present at /var/nagios/ignore.dat. If
> yes, alert is suppressed
> h2. What exactly is broken
> At jira https://issues.apache.org/jira/browse/AMBARI-6358 we had a
> requirement to have only one 'HBase Master CPU utilization' check even in HA
> mode. So this check is bound to Nagios host (to be executed only once even if
> hbase master hostgroup has more than one host, like it is done for "* Percent
> Count" alerts). As a result, Hbase Master alert origin data does not match
> any entry at file /var/nagios/ignore.dat . That's why the alert is not
> suppressed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)