Re: [users] Questions concerning notifications and service units restarting

Nagendra Kumar Fri, 07 Jun 2019 23:51:46 -0700

Hi William,
How are you doing?
Please find my responses inlined below with [NK]


The root cause looks frequent SU/Comp failures. If that can be stopped, then
all the rest of the mentioned problems are solved.
Unfortunately, there is no way to stop escalations as per Specs and it will
lead to node reboot(or node switchover/failover) (as it happened in your
case) in the end of the escalation ladder.

So, you can have either of the following as per my understanding of the
mentioned issue:
1. You can write a patch(proprietary & customized for your system as OpenSAF
5.1.0 is not supported), which stops the escalations after certain number of
failures.
2. You can integrate OpenSAF cluster with the central Dashboard so that all
the notifications & alarms being generated are delivered to Dashboard at run
time, which gets monitored 24x7. Once, the issue comes, admin should be able
to lock/lock-in the SUs from Dashboard. Please note that lock/lock-in
commands also generate few notifications & alarms.


Thanks
-Nagendra, +91-9866424860
High Availability Solutions
- Did you see our latest product portfolio?
www.hasolutions.in
[email protected]
Delaware, USA: +1 508-422-7725    |    Hyderabad, India: +91 798-992-5293 


-----Original Message-----
From: William R Elliott [mailto:[email protected]] 
Sent: 07 June 2019 23:52
To: [email protected]
Cc: Lisa Ann Lentz-Liddell; David S Thompson
Subject: [users] Questions concerning notifications and service units
restarting

Hi,
We are using opensaf 5.1.0.
We have an environment running 6 machines, two controllers and four
payloads.  For several different
reasons such as incorrect configuration, some of our components will start
correctly, error and then exit.
Since we have components that are configured with a recommended recovery of
SA_AMF_COMPONENT_FAILOVER (service unit restart), this can cause service
units to bounce
constantly until what ever is causing the error is corrected. Unfortunately,
this can happen in the middle
of the night and since these environments are not monitored 24-7 the result
is many service units
constantly bouncing without any correction.  After a while of this constant
bouncing of service units, we see that what appears to be opensaf being
overwhelmed by the notifications which in turn causes
instability in the cluster which can cause the following:
1.       Execution of admin commands time out.
2.       One or both of the controllers reboot.
Questions:
1)      Is there any way to throttle the notifications in this scenario?
[NK]: Amf sends the notifications in such scenarios as per Specs. So, you
can implement filtering at application level.
But, anyway, Amf will send the notifications to Ntf Server.

2)      Is there any way for opensaf to automatically lock for instantiation
a service unit when it detects it bouncing so many times?
[NK]: Escalation is defined by Specs, and there is no mention of stopping
the escalation. There is no way to automatically lock SU in such cases.
      You can write a process, which can monitor the notifications for a
particular object within a specified time and can issue lock/lock-in if
multiple notifications are received for that particular process.

3)      Is there any way to trigger a lock/lock-in of the SU with the
component or the
instantiation/cleanup script?
[NK]: You can run amf-adm lock <object name> from any script, but then I am
not sure how you will count the restart/failures within a specified period
for that particular object.


Note: When the bouncing is occurring, it's not because the instantiation and
cleanup scripts are failing.
It's because the components themselves are encountering a fatal error that
causes them to exit.

Here are some details of the opensaf /var/log/messages during one of our
controller node failures:
opensaf began to get many failures to send state change notifications
(because some of our
service unit's were bouncing frequently):
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER
sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: NO Node 'rbm-fe-s1-h1' left
the cluster
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER
sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER
sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendAlarmNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER
sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER
sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER
sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER
sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER
sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER
sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER
sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER
sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
.........
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER
sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER
sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED
May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send to ntfa
failed rc: 2
May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED
May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED
May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED
May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED
This all then leads to an opensaf AMF director heart beat timeout which
forces a restart:
May 28 10:47:58 rbm-fe-s2-h1 osafdtmd[21729]: NO Lost contact with
'rbm-fe-s1-h2'
May 28 10:47:58 rbm-fe-s2-h1 osafimmd[21801]: NO MDS event from svc_id 25
(change:4,
dest:604795819815973)
May 28 10:47:58 rbm-fe-s2-h1 osafimmnd[21819]: NO Global discard node
received for
nodeId:2260f pid:29733
May 28 10:47:58 rbm-fe-s2-h1 osafimmnd[21819]: NO Implementer disconnected
14 <0,
2260f(down)> (MsgQueueService140815)
May 28 10:47:58 rbm-fe-s2-h1 osafclmd[21915]: NO Node 140815 went down. Not
sending track
callback for agents on that node
May 28 10:47:58 rbm-fe-s2-h1 osafclmd[21915]: NO Node 140815 went down. Not
sending track
callback for agents on that node
May 28 10:47:58 rbm-fe-s2-h1 osafclmd[21915]: NO Node 140815 went down. Not
sending track
callback for agents on that node
May 28 10:47:59 rbm-fe-s2-h1 osafamfnd[21949]: Rebooting OpenSAF NodeId =
148751 EE Name
= , Reason: AMF director heart beat timeout, OwnNodeId = 148751,
SupervisionTime = 0
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Creating lock file:
/tmp/opensaf_bounce.lock
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Bouncing local node; timeout=
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: node ID = 148751
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Current user name = root
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Current user ID = 0
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Start force shut down of
osafdtmd, PID = 21729
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Waiting for 5 seconds to allow
processes to shut
down

thanks





________________________________
The information transmitted herein is intended only for the person or entity
to which it is addressed and may contain confidential, proprietary and/or
privileged material. Any review, retransmission, dissemination or other use
of, or taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.

_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users



_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Questions concerning notifications and service units restarting

Reply via email to