[users] Questions concerning notifications and service units restarting

William R Elliott Fri, 07 Jun 2019 11:37:46 -0700

Hi,
We are using opensaf 5.1.0.
We have an environment running 6 machines, two controllers and four payloads.  
For several different
reasons such as incorrect configuration, some of our components will start 
correctly, error and then exit.
Since we have components that are configured with a recommended recovery of
SA_AMF_COMPONENT_FAILOVER (service unit restart), this can cause service units 
to bounce
constantly until what ever is causing the error is corrected. Unfortunately, 
this can happen in the middle
of the night and since these environments are not monitored 24-7 the result is 
many service units
constantly bouncing without any correction.  After a while of this constant 
bouncing of service units, we see that what appears to be opensaf being 
overwhelmed by the notifications which in turn causes
instability in the cluster which can cause the following:
1.       Execution of admin commands time out.
2.       One or both of the controllers reboot.
Questions:
1)      Is there any way to throttle the notifications in this scenario?
2)      Is there any way for opensaf to automatically lock for instantiation a 
service unit when it detects it bouncing so many times?
3)      Is there any way to trigger a lock/lock-in of the SU with the component 
or the
instantiation/cleanup script?
Note: When the bouncing is occurring, it's not because the instantiation and 
cleanup scripts are failing.
It's because the components themselves are encountering a fatal error that 
causes them to exit.


Here are some details of the opensaf /var/log/messages during one of our 
controller node failures:
opensaf began to get many failures to send state change notifications (because 
some of our
service unit's were bouncing frequently):
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: NO Node 'rbm-fe-s1-h1' left the 
cluster
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendAlarmNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
.........
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd:
saNtfNotificationSend Failed (6)
May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED
May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send to ntfa 
failed rc: 2
May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED
May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED
May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED
May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED
This all then leads to an opensaf AMF director heart beat timeout which forces 
a restart:
May 28 10:47:58 rbm-fe-s2-h1 osafdtmd[21729]: NO Lost contact with 
'rbm-fe-s1-h2'
May 28 10:47:58 rbm-fe-s2-h1 osafimmd[21801]: NO MDS event from svc_id 25 
(change:4,
dest:604795819815973)
May 28 10:47:58 rbm-fe-s2-h1 osafimmnd[21819]: NO Global discard node received 
for
nodeId:2260f pid:29733
May 28 10:47:58 rbm-fe-s2-h1 osafimmnd[21819]: NO Implementer disconnected 14 
<0,
2260f(down)> (MsgQueueService140815)
May 28 10:47:58 rbm-fe-s2-h1 osafclmd[21915]: NO Node 140815 went down. Not 
sending track
callback for agents on that node
May 28 10:47:58 rbm-fe-s2-h1 osafclmd[21915]: NO Node 140815 went down. Not 
sending track
callback for agents on that node
May 28 10:47:58 rbm-fe-s2-h1 osafclmd[21915]: NO Node 140815 went down. Not 
sending track
callback for agents on that node
May 28 10:47:59 rbm-fe-s2-h1 osafamfnd[21949]: Rebooting OpenSAF NodeId = 
148751 EE Name
= , Reason: AMF director heart beat timeout, OwnNodeId = 148751, 
SupervisionTime = 0
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Creating lock file: 
/tmp/opensaf_bounce.lock
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Bouncing local node; timeout=
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: node ID = 148751
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Current user name = root
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Current user ID = 0
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Start force shut down of  
osafdtmd, PID = 21729
May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Waiting for 5 seconds to allow 
processes to shut
down

thanks





________________________________
The information transmitted herein is intended only for the person or entity to 
which it is addressed and may contain confidential, proprietary and/or 
privileged material. Any review, retransmission, dissemination or other use of, 
or taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and delete the material from any computer.

_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

[users] Questions concerning notifications and service units restarting

Reply via email to