Hi William, How are you doing? Please find my responses inlined below with [NK]
The root cause looks frequent SU/Comp failures. If that can be stopped, then all the rest of the mentioned problems are solved. Unfortunately, there is no way to stop escalations as per Specs and it will lead to node reboot(or node switchover/failover) (as it happened in your case) in the end of the escalation ladder. So, you can have either of the following as per my understanding of the mentioned issue: 1. You can write a patch(proprietary & customized for your system as OpenSAF 5.1.0 is not supported), which stops the escalations after certain number of failures. 2. You can integrate OpenSAF cluster with the central Dashboard so that all the notifications & alarms being generated are delivered to Dashboard at run time, which gets monitored 24x7. Once, the issue comes, admin should be able to lock/lock-in the SUs from Dashboard. Please note that lock/lock-in commands also generate few notifications & alarms. Thanks -Nagendra, +91-9866424860 High Availability Solutions - Did you see our latest product portfolio? www.hasolutions.in [email protected] Delaware, USA: +1 508-422-7725 | Hyderabad, India: +91 798-992-5293 -----Original Message----- From: William R Elliott [mailto:[email protected]] Sent: 07 June 2019 23:52 To: [email protected] Cc: Lisa Ann Lentz-Liddell; David S Thompson Subject: [users] Questions concerning notifications and service units restarting Hi, We are using opensaf 5.1.0. We have an environment running 6 machines, two controllers and four payloads. For several different reasons such as incorrect configuration, some of our components will start correctly, error and then exit. Since we have components that are configured with a recommended recovery of SA_AMF_COMPONENT_FAILOVER (service unit restart), this can cause service units to bounce constantly until what ever is causing the error is corrected. Unfortunately, this can happen in the middle of the night and since these environments are not monitored 24-7 the result is many service units constantly bouncing without any correction. After a while of this constant bouncing of service units, we see that what appears to be opensaf being overwhelmed by the notifications which in turn causes instability in the cluster which can cause the following: 1. Execution of admin commands time out. 2. One or both of the controllers reboot. Questions: 1) Is there any way to throttle the notifications in this scenario? [NK]: Amf sends the notifications in such scenarios as per Specs. So, you can implement filtering at application level. But, anyway, Amf will send the notifications to Ntf Server. 2) Is there any way for opensaf to automatically lock for instantiation a service unit when it detects it bouncing so many times? [NK]: Escalation is defined by Specs, and there is no mention of stopping the escalation. There is no way to automatically lock SU in such cases. You can write a process, which can monitor the notifications for a particular object within a specified time and can issue lock/lock-in if multiple notifications are received for that particular process. 3) Is there any way to trigger a lock/lock-in of the SU with the component or the instantiation/cleanup script? [NK]: You can run amf-adm lock <object name> from any script, but then I am not sure how you will count the restart/failures within a specified period for that particular object. Note: When the bouncing is occurring, it's not because the instantiation and cleanup scripts are failing. It's because the components themselves are encountering a fatal error that causes them to exit. Here are some details of the opensaf /var/log/messages during one of our controller node failures: opensaf began to get many failures to send state change notifications (because some of our service unit's were bouncing frequently): May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: NO Node 'rbm-fe-s1-h1' left the cluster May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendAlarmNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) ......... May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send to ntfa failed rc: 2 May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED This all then leads to an opensaf AMF director heart beat timeout which forces a restart: May 28 10:47:58 rbm-fe-s2-h1 osafdtmd[21729]: NO Lost contact with 'rbm-fe-s1-h2' May 28 10:47:58 rbm-fe-s2-h1 osafimmd[21801]: NO MDS event from svc_id 25 (change:4, dest:604795819815973) May 28 10:47:58 rbm-fe-s2-h1 osafimmnd[21819]: NO Global discard node received for nodeId:2260f pid:29733 May 28 10:47:58 rbm-fe-s2-h1 osafimmnd[21819]: NO Implementer disconnected 14 <0, 2260f(down)> (MsgQueueService140815) May 28 10:47:58 rbm-fe-s2-h1 osafclmd[21915]: NO Node 140815 went down. Not sending track callback for agents on that node May 28 10:47:58 rbm-fe-s2-h1 osafclmd[21915]: NO Node 140815 went down. Not sending track callback for agents on that node May 28 10:47:58 rbm-fe-s2-h1 osafclmd[21915]: NO Node 140815 went down. Not sending track callback for agents on that node May 28 10:47:59 rbm-fe-s2-h1 osafamfnd[21949]: Rebooting OpenSAF NodeId = 148751 EE Name = , Reason: AMF director heart beat timeout, OwnNodeId = 148751, SupervisionTime = 0 May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Creating lock file: /tmp/opensaf_bounce.lock May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Bouncing local node; timeout= May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: node ID = 148751 May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Current user name = root May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Current user ID = 0 May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Start force shut down of osafdtmd, PID = 21729 May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Waiting for 5 seconds to allow processes to shut down thanks ________________________________ The information transmitted herein is intended only for the person or entity to which it is addressed and may contain confidential, proprietary and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
