Hi all,
We're currently thinking about how to add SNMP alarms to Sprout, Homestead and
Ralf (but not Bono or Homer at this point), including what conditions are
alarmable and how we report them.
The spec below represents our current plan. It would be great to hear the
community's thoughts on this, including:
· what you're using to monitor SNMP alarms at the moment, so we can
test for compatibility with the popular solutions
· any Clearwater problems you've seen and struggled to diagnose, in
case it makes sense to add an alarm for them
· if you're using snmpd for other monitoring, any pitfalls or
difficulties we might want to be aware of
Best regards,
Rob
________________________________
Overview
Clearwater nodes use alarms to report errors over the SNMP interface when they
are in an abnormal state, and clear them when they return to normal. They only
report errors relating to that node - not errors relating to the whole
deployment, which is the role of an external monitoring service.
Two approaches are taken to avoid alarms getting out of sync:
* When the Sprout/Homestead/Ralf etc. process starts or restarts,
Clearwater nodes will clear all alarms they could raise, in case they crashed
or were stopped with some alarms active.
* An operator or orchestrator will be able to make Clearwater clear all
alarms and then re-raise active ones, by running a single command-line command.
This might be used after an SNMP manager has lost connectivity to the
deployment and wants to make sure it is back in sync.
Error indications come in two forms:
* For clearly-defined errors not based on thresholds, the Clearwater node
sends an SNMP notification to a configured external SNMP manager, referencing
an ITU alarm MIB from RFC 3877, which gives context such as severity to that
alarm.
* For errors based on a threshold set on a statistic (such as latency
targets or number of failed connections), the Clearwater node exposes that
statistic over SNMP. A downstream statistics aggregator (from the MANO layer)
monitors these statistics, compares them to its configured thresholds, and
raises alarms on that basis.
The details of errors in each of these two categories are defined in the
detailed specification below.
Background
This section provides useful background on the IETF and ITU specifications
relevant to this feature.
ITU X.733 and X.736 are the ITU specifications describing a model for telephony
network alarms. Section 5 of RFC 3877 defines a convention for representing the
X.733 ITU alarms over SNMP, using the ituAlarmTable MIB to define alarms
fitting the ituAlarmModel. These are extensions of the similar but more general
(i.e. not ITU-specific) alarm MIBs defined in section 4.
When an alarm condition (as defined below) occurs, a SNMPv2 NOTIFICATION of
type alarmActiveState is sent to the external SNMP manager (referencing the
particular alarm MIB from the table) to indicate this state change.
When an alarm condition clears, a SNMPv2 NOTIFICATION of type alarmClearState
is sent to the external SNMP manager (again referencing the particular alarm
MIB from the table) to indicate this state change.
ItuAlarmEntry MIBs also allow the specification of details useful in telecom
networks, such as the severity of an alarm and its probable cause
(ituAlarmPerceivedSeverity, ituAlarmProbableCause). The SNMP interface for
these details is specified in RFC 3877, but their meaning is defined in ITU
X.733 and ITU X.736.
Although RFC 3877 includes a way to provide persistent alarms, through a MIB
table of active and cleared alarms which can be queried, Clearwater will not
use this, as it is not widely supported by SNMP managers or other SNMP agents.
Detailed specification
This section adds details to support and flesh out the Overview section above.
Extent of RFC 3877 support
Clearwater nodes provide the following over SNMP:
* Per-node ITU alarm notifications (section 3.3.2 of the RFC)
* Probable cause indications (section 3.3.3)
Clearwater nodes do not support runtime configuration of alarm models over a
read-write SNMP interface (section 3.3.7). If users wish to treat particular
alarms with a non-default severity, we expect SNMP managers to be able to do
this. They also do not support per-user access to alarms based on RFC 3411
security credentials (section 3.4) or, as noted above, persistent alarm tables.
In terms of specific MIBs, we provide a table of alarmModelEntry MIBs and
ituAlarmEntry MIBs (which augments alarmModelEntry) - each instance of these
MIBs defines a type of alarm and its attributes, for example defining that a
Sprout-Homestead connectivity failure is a possible alarm with critical
severity.
* The ituAlarmEntry MIB is supported, with:
* ituAlarmPerceivedSeverity set to "major" or "critical" depending on
whether this alarm represents a partial service impact or a complete service
outage
* ituAlarmEventType, ituAlarmProbableCause and ituAlarmAdditionalText
set to fixed values for each event type, as defined below
* ituAlarmGenericModel is set to point to the matching alarmModelEntry
MIB
* The AlarmModelEntry MIB is supported:
* alarmModelIndex is set to a unique integer for each alarm
* alarmModelState is set to 1 (for the cleared state) or 2 (for the
active state). Different severities are represented by
ituAlarmPerceivedSeverity, not by this MIB.
* alarmModelNotificationId is set to alarmActiveState
* alarmModelVarbindIndex, alarmModelVarbindValue and
alarmModelVarbindSubtree are set to 0 (as the use of alarmActiveState or
alarmClearState notifications indicates the state, rather than a variable in a
notification)
* alarmModelDescription is set to the same value as
ituAlarmAdditionalText
* alarmModelSpecificPointer is set to point to the ituAlarmEntry MIB
* alarmClearResourcePrefix is always 0.0, as Clearwater elements (e.g.
memcached) do not map to individual MIBs
* alarmModelRowStatus is not supported, as we do not provide a writable
SNMP interface
Alarms for individual nodes vs. alarms for the whole deployment
ITU X.733 says that severity levels "provide an indication of how it is
perceived that the capability of the managed object has been affected". That
is, alarms relate specifically to the capability of the managed object (the
individual Clearwater nodes). A critical or major alarm will therefore indicate
that service has been affected on the individual node, regardless of whether it
has been affected for the deployment as a whole (e.g. calls are still going
through the node but failing) or not (e.g. the node has failed so thoroughly
that calls/HTTP requests are being routed elsewhere).
We require the management and orchestration layer to contain function capable
of determining the state of the deployment as a whole from the state of
individual nodes; no Clearwater node has enough knowledge about the whole
deployment to do this.
General conditions for alarms
Alarms representing the failure of a process trigger as soon as the failure is
detected, but will then clear once it is restarted.
Alarms representing the failure of a connection (e.g. Sprout's connections to
Homestead) trigger as soon as the connection times out or an error response is
seen, and clear when the next successful response is seen over that connection.
This means that Clearwater nodes will alarm briefly due to management actions,
namely upgrade, elastic scaling, or stopping or restarting that Clearwater
node. They may also alarm briefly in situations not requiring operator
attention (e.g. crashes where the failed process is immediately restarted) so
that internal failures are visible to support engineers and can be easily
tracked.
Error conditions
Clearwater alarms may not reach the SNMP manager in cases where the failure is
so drastic that SNMPd cannot communicate, such as a complete power outage or
network outage. The cloud environment must provide monitoring tools (such as
regular SNMP polling) to detect failures of this kind.
When SNMPd fails and is restarted, it loses its record of statistics, and will
not make that SNMP OID available until receiving the next statistics update
from Clearwater. This is a relatively unlikely failure case, but if it does
happen, it quickly recovers because the nodes re-publish their statistics every
five seconds based on a Last Value Cache.
If a Clearwater component fails and is automatically restarted, its alarms will
automatically clear. If the restart fixes the issue, this is the right
behaviour; if not, a new alarm will be raised when the failure happens again.
If a Clearwater node disappears completely (scaled down or otherwise destroyed
by the orchestrator) it will not be able to clear any outstanding alarms it
raised. We expect the EMS to allow removing a node's alarms if it has been
destroyed, either manually or automatically (i.e. driven by the orchestrator).
Specific alarm conditions for each node type
This section describes the alarms generated by Clearwater nodes themselves.
Operators are also likely to configure threshold alarms based on statistics for
those nodes (described below).
These alarm types are classified by their default ITU severity level (critical,
major or minor) As a rough guide, critical severity indicates that the node is
totally out of service, major represents a serious degradation of service (e.g.
some but not all calls failing, or a billing outage), and minor represents a
non-service-affecting fault. (See ITU X.733 8.1.2.3.)
All Clearwater alarms have ITU event type "operationalViolation" as per ITU
X.736, and additional text which reflects the description of the error below.
The "probable cause" field is set to "softwareError (163)" if a process fails,
and "underlayingResourceUnavailable (165)" (sic) if it fails to contact an
external resource.
· Sprout nodes
o Critical
§ Sprout process fails
§ Memcached process fails
§ No Homestead nodes can be contacted (including both network-level errors
such as timeouts and software-level errors such as overload)
§ No Memcacheds can be contacted
o Major
§ Chronos process fails
§ Memcached vBucket completely inaccessible
§ No Ralfs can be contacted
§ No ENUM servers can be contacted
§ Chronos completely failed to pop a timer
§ No Chronoses can be contacted
· Homestead nodes
o Critical
§ Homestead process fails
§ Cassandra process fails
§ Failure to contact to local Cassandra
§ Failure to contact HSS (either transport-level or application-level)
o Major
§ Any Cassandra node in the ring is failed
· Ralf nodes
o Critical
§ Ralf process fails
§ Memcached process fails
§ Chronos process fails
§ No Memcacheds can be contacted
§ No Chronoses can be contacted
§ Chronos completely failed to pop a timer
§ Failure to contact CDF (either transport-level or application-level)
Statistics to be used for downstream threshold alarms
These statistics are to be used by downstream elements to generate threshold
alarms. The definition of any generated alarms (e.g. MIB fields) is left to
those downstream elements.
· Sprout nodes
o Latency over the last 5 seconds
o For each of Homestead, Bono, IBCF, MGCF, Application servers, Ralf,
Memcached, ENUM:
§ Attempts to open connection over the last 5 seconds (count per IP)
§ Failure to open connection over the last 5 seconds (count per iP)
§ Requests over the last 5 seconds (count per IP)
§ DNS errors/timeouts received over the last 5 seconds (count per IP)
o Calls failed due to Homestead errors over the last 5 seconds
o Calls failed due to overload over the last 5 seconds
· Bono nodes
o Latency over the last 5 seconds
o For each of Sprout, Ralf:
§ Attempts to open connection over the last 5 seconds (count per IP)
§ Failure to open connection over the last 5 seconds (count per iP)
§ Requests over the last 5 seconds (count per IP)
§ 5xx errors received over the last 5 seconds (count per IP)
o Calls failed due to overload over the last 5 seconds
· Homestead nodes
o Latency over the last 5 seconds
o For the HSS:
§ Attempts to open connection over the last 5 seconds (count per IP)
§ Failure to open connection over the last 5 seconds (count per iP)
§ Requests over the last 5 seconds (count per IP)
§ Diameter errors received over the last 5 seconds (count per IP)
o Requests failed due to overload over the last 5 seconds
· Ralf nodes
o Latency over the last 5 seconds
o For the connection to the external billing system:
§ Attempts to open connection over the last 5 seconds (count per IP)
§ Failure to open connection over the last 5 seconds (count per iP)
§ Requests over the last 5 seconds (count per IP)
§ Diameter errors received over the last 5 seconds (count per IP)
o Requests dropped due to overload over the last 5 seconds
All Clearwater nodes will also expose standard SNMP stats to allow downstream
aggregators to notice and alarm on the following conditions:
· High CPU usage
· High memory usage
· Slow disk IO
· Disk full
· High load average (i.e. general slow system)
· Signalling network outage
Configuration
There is only one configuration option related to this feature:
· A /etc/clearwater/config option, snmp_ip, defines the IP address or
hostname (or comma-separated list thereof) where SNMP notifications should be
sent. This translates into one or more snmpd.conf informsink directives.
_______________________________________________
Clearwater mailing list
[email protected]
http://lists.projectclearwater.org/listinfo/clearwater