[Clearwater] Proposed spec for SNMP alarms

Robert Day Thu, 26 Jun 2014 06:10:44 -0700

Hi all,

We're currently thinking about how to add SNMP alarms to Sprout, Homestead and 
Ralf (but not Bono or Homer at this point), including what conditions are 
alarmable and how we report them.


The spec below represents our current plan. It would be great to hear the 
community's thoughts on this, including:

·         what you're using to monitor SNMP alarms at the moment, so we can 
test for compatibility with the popular solutions

·         any Clearwater problems you've seen and struggled to diagnose, in 
case it makes sense to add an alarm for them

·         if you're using snmpd for other monitoring, any pitfalls or 
difficulties we might want to be aware of

Best regards,
Rob

________________________________
Overview

Clearwater nodes use alarms to report errors over the SNMP interface when they 
are in an abnormal state, and clear them when they return to normal. They only 
report errors relating to that node - not errors relating to the whole 
deployment, which is the role of an external monitoring service.

Two approaches are taken to avoid alarms getting out of sync:

  *   When the Sprout/Homestead/Ralf etc. process starts or restarts, 
Clearwater nodes will clear all alarms they could raise, in case they crashed 
or were stopped with some alarms active.
  *   An operator or orchestrator will be able to make Clearwater clear all 
alarms and then re-raise active ones, by running a single command-line command. 
This might be used after an SNMP manager has lost connectivity to the 
deployment and wants to make sure it is back in sync.
Error indications come in two forms:

  *   For clearly-defined errors not based on thresholds, the Clearwater node 
sends an SNMP notification to a configured external SNMP manager, referencing 
an ITU alarm MIB from RFC 3877, which gives context such as severity to that 
alarm.
  *   For errors based on a threshold set on a statistic (such as latency 
targets or number of failed connections), the Clearwater node exposes that 
statistic over SNMP. A downstream statistics aggregator (from the MANO layer) 
monitors these statistics, compares them to its configured thresholds, and 
raises alarms on that basis.
The details of errors in each of these two categories are defined in the 
detailed specification below.

Background

This section provides useful background on the IETF and ITU specifications 
relevant to this feature.

ITU X.733 and X.736 are the ITU specifications describing a model for telephony 
network alarms. Section 5 of RFC 3877 defines a convention for representing the 
X.733 ITU alarms over SNMP, using the ituAlarmTable MIB to define alarms 
fitting the ituAlarmModel. These are extensions of the similar but more general 
(i.e. not ITU-specific) alarm MIBs defined in section 4.

When an alarm condition (as defined below) occurs, a SNMPv2 NOTIFICATION of 
type alarmActiveState is sent to the external SNMP manager (referencing the 
particular alarm MIB from the table) to indicate this state change.

When an alarm condition clears, a SNMPv2 NOTIFICATION of type alarmClearState 
is sent to the external SNMP manager (again referencing the particular alarm 
MIB from the table) to indicate this state change.

ItuAlarmEntry MIBs also allow the specification of details useful in telecom 
networks, such as the severity of an alarm and its probable cause 
(ituAlarmPerceivedSeverity, ituAlarmProbableCause). The SNMP interface for 
these details is specified in RFC 3877, but their meaning is defined in ITU 
X.733 and ITU X.736.

Although RFC 3877 includes a way to provide persistent alarms, through a MIB 
table of active and cleared alarms which can be queried, Clearwater will not 
use this, as it is not widely supported by SNMP managers or other SNMP agents.

Detailed specification

This section adds details to support and flesh out the Overview section above.

Extent of RFC 3877 support
Clearwater nodes provide the following over SNMP:

  *   Per-node ITU alarm notifications (section 3.3.2 of the RFC)
  *   Probable cause indications  (section 3.3.3)
Clearwater nodes do not support runtime configuration of alarm models over a 
read-write SNMP interface (section 3.3.7). If users wish to treat particular 
alarms with a non-default severity, we expect SNMP managers to be able to do 
this. They also do not support per-user access to alarms based on RFC 3411 
security credentials (section 3.4) or, as noted above, persistent alarm tables.

In terms of specific MIBs, we provide a table of alarmModelEntry MIBs and 
ituAlarmEntry MIBs (which augments alarmModelEntry) - each instance of these 
MIBs defines a type of alarm and its attributes, for example defining that a 
Sprout-Homestead connectivity failure is a possible alarm with critical 
severity.

  *   The ituAlarmEntry MIB is supported, with:
     *   ituAlarmPerceivedSeverity set to "major" or "critical" depending on 
whether this alarm represents a partial service impact or a complete service 
outage
     *   ituAlarmEventType, ituAlarmProbableCause and ituAlarmAdditionalText 
set to fixed values for each event type, as defined below
     *   ituAlarmGenericModel is set to point to the matching alarmModelEntry 
MIB
  *   The AlarmModelEntry MIB is supported:
     *   alarmModelIndex is set to a unique integer for each alarm
     *   alarmModelState is set to 1 (for the cleared state) or 2 (for the 
active state). Different severities are represented by 
ituAlarmPerceivedSeverity, not by this MIB.
     *   alarmModelNotificationId is set to alarmActiveState
     *   alarmModelVarbindIndex, alarmModelVarbindValue and 
alarmModelVarbindSubtree are set to 0 (as the use of alarmActiveState or 
alarmClearState notifications indicates the state, rather than a variable in a 
notification)
     *   alarmModelDescription is set to the same value as 
ituAlarmAdditionalText
     *   alarmModelSpecificPointer is set to point to the ituAlarmEntry MIB
     *   alarmClearResourcePrefix is always 0.0, as Clearwater elements (e.g. 
memcached) do not map to individual MIBs
     *   alarmModelRowStatus is not supported, as we do not provide a writable 
SNMP interface
Alarms for individual nodes vs. alarms for the whole deployment
ITU X.733 says that severity levels "provide an indication of how it is 
perceived that the capability of the managed object has been affected". That 
is, alarms relate specifically to the capability of the managed object (the 
individual Clearwater nodes). A critical or major alarm will therefore indicate 
that service has been affected on the individual node, regardless of whether it 
has been affected for the deployment as a whole (e.g. calls are still going 
through the node but failing) or not (e.g. the node has failed so thoroughly 
that calls/HTTP requests are being routed elsewhere).

We require the management and orchestration layer to contain function capable 
of determining the state of the deployment as a whole from the state of 
individual nodes; no Clearwater node has enough knowledge about the whole 
deployment to do this.

General conditions for alarms
Alarms representing the failure of a process trigger as soon as the failure is 
detected, but will then clear once it is restarted.

Alarms representing the failure of a connection (e.g. Sprout's connections to 
Homestead) trigger as soon as the connection times out or an error response is 
seen, and clear when the next successful response is seen over that connection.

This means that Clearwater nodes will alarm briefly due to management actions, 
namely upgrade, elastic scaling, or stopping or restarting that Clearwater 
node. They may also alarm briefly in situations not requiring operator 
attention (e.g. crashes where the failed process is immediately restarted) so 
that internal failures are visible to support engineers and can be easily 
tracked.

Error conditions
Clearwater alarms may not reach the SNMP manager in cases where the failure is 
so drastic that SNMPd cannot communicate, such as a complete power outage or 
network outage. The cloud environment must provide monitoring tools (such as 
regular SNMP polling) to detect failures of this kind.

When SNMPd fails and is restarted, it loses its record of statistics, and will 
not make that SNMP OID available until receiving the next statistics update 
from Clearwater. This is a relatively unlikely failure case, but if it does 
happen, it quickly recovers because the nodes re-publish their statistics every 
five seconds based on a Last Value Cache.

If a Clearwater component fails and is automatically restarted, its alarms will 
automatically clear. If the restart fixes the issue, this is the right 
behaviour; if not, a new alarm will be raised when the failure happens again.

If a Clearwater node disappears completely (scaled down or otherwise destroyed 
by the orchestrator) it will not be able to clear any outstanding alarms it 
raised. We expect the EMS to allow removing a node's alarms if it has been 
destroyed, either manually or automatically (i.e. driven by the orchestrator).

Specific alarm conditions for each node type
This section describes the alarms generated by Clearwater nodes themselves. 
Operators are also likely to configure threshold alarms based on statistics for 
those nodes (described below).

These alarm types are classified by their default ITU severity level (critical, 
major or minor) As a rough guide, critical severity indicates that the node is 
totally out of service, major represents a serious degradation of service (e.g. 
some but not all calls failing, or a billing outage), and minor represents a 
non-service-affecting fault. (See ITU X.733 8.1.2.3.)

All Clearwater alarms have ITU event type "operationalViolation" as per ITU 
X.736, and additional text which reflects the description of the error below. 
The "probable cause" field is set to "softwareError (163)" if a process fails, 
and "underlayingResourceUnavailable (165)" (sic) if it fails to contact an 
external resource.

·         Sprout nodes

o   Critical

§  Sprout process fails

§  Memcached process fails

§  No Homestead nodes can be contacted (including both network-level errors 
such as timeouts and software-level errors such as overload)

§  No Memcacheds can be contacted

o   Major

§  Chronos process fails

§  Memcached vBucket completely inaccessible

§  No Ralfs can be contacted

§  No ENUM servers can be contacted

§  Chronos completely failed to pop a timer

§  No Chronoses can be contacted

·         Homestead nodes

o   Critical

§  Homestead process fails

§  Cassandra process fails

§  Failure to contact to local Cassandra

§  Failure to contact HSS (either transport-level or application-level)

o   Major

§  Any Cassandra node in the ring is failed



·         Ralf nodes

o   Critical

§  Ralf process fails

§  Memcached process fails

§  Chronos process fails

§  No Memcacheds can be contacted

§  No Chronoses can be contacted

§  Chronos completely failed to pop a timer

§  Failure to contact CDF (either transport-level or application-level)



Statistics to be used for downstream threshold alarms



These statistics are to be used by downstream elements to generate threshold 
alarms. The definition of any generated alarms (e.g. MIB fields) is left to 
those downstream elements.



·         Sprout nodes

o   Latency over the last 5 seconds

o   For each of Homestead, Bono, IBCF, MGCF, Application servers, Ralf, 
Memcached, ENUM:

§  Attempts to open connection over the last 5 seconds (count per IP)

§  Failure to open connection  over the last 5 seconds (count per iP)

§  Requests over the last 5 seconds (count per IP)

§  DNS errors/timeouts received over the last 5 seconds (count per IP)

o   Calls failed due to Homestead errors over the last 5 seconds

o   Calls failed due to overload over the last 5 seconds

·         Bono nodes

o   Latency over the last 5 seconds

o   For each of Sprout, Ralf:

§  Attempts to open connection over the last 5 seconds (count per IP)

§  Failure to open connection  over the last 5 seconds (count per iP)

§  Requests over the last 5 seconds (count per IP)

§  5xx errors received over the last 5 seconds (count per IP)

o   Calls failed due to overload over the last 5 seconds

·         Homestead nodes

o   Latency over the last 5 seconds

o   For the HSS:

§  Attempts to open connection over the last 5 seconds (count per IP)

§  Failure to open connection  over the last 5 seconds (count per iP)

§  Requests over the last 5 seconds (count per IP)

§  Diameter errors received over the last 5 seconds (count per IP)

o   Requests failed due to overload over the last 5 seconds

·         Ralf nodes

o   Latency over the last 5 seconds

o   For the connection to the external billing system:

§  Attempts to open connection over the last 5 seconds (count per IP)

§  Failure to open connection  over the last 5 seconds (count per iP)

§  Requests over the last 5 seconds (count per IP)

§  Diameter errors received over the last 5 seconds (count per IP)

o   Requests dropped due to overload over the last 5 seconds



All Clearwater nodes will also expose standard SNMP stats to allow downstream 
aggregators to notice and alarm on the following conditions:

·         High CPU usage

·         High memory usage

·         Slow disk IO

·         Disk full

·         High load average (i.e. general slow system)

·         Signalling network outage





Configuration



There is only one configuration option related to this feature:

·         A /etc/clearwater/config option, snmp_ip, defines the IP address or 
hostname (or comma-separated list thereof) where SNMP notifications should be 
sent. This translates into one or more snmpd.conf informsink directives.

_______________________________________________
Clearwater mailing list
[email protected]
http://lists.projectclearwater.org/listinfo/clearwater

[Clearwater] Proposed spec for SNMP alarms

Reply via email to