Hi, We are using opensaf 5.1.0. We have an environment running 6 machines, two controllers and four payloads. For several different reasons such as incorrect configuration, some of our components will start correctly, error and then exit. Since we have components that are configured with a recommended recovery of SA_AMF_COMPONENT_FAILOVER (service unit restart), this can cause service units to bounce constantly until what ever is causing the error is corrected. Unfortunately, this can happen in the middle of the night and since these environments are not monitored 24-7 the result is many service units constantly bouncing without any correction. After a while of this constant bouncing of service units, we see that what appears to be opensaf being overwhelmed by the notifications which in turn causes instability in the cluster which can cause the following: 1. Execution of admin commands time out. 2. One or both of the controllers reboot. Questions: 1) Is there any way to throttle the notifications in this scenario? 2) Is there any way for opensaf to automatically lock for instantiation a service unit when it detects it bouncing so many times? 3) Is there any way to trigger a lock/lock-in of the SU with the component or the instantiation/cleanup script? Note: When the bouncing is occurring, it's not because the instantiation and cleanup scripts are failing. It's because the components themselves are encountering a fatal error that causes them to exit.
Here are some details of the opensaf /var/log/messages during one of our controller node failures: opensaf began to get many failures to send state change notifications (because some of our service unit's were bouncing frequently): May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: NO Node 'rbm-fe-s1-h1' left the cluster May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendAlarmNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) ......... May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafamfd[21932]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send to ntfa failed rc: 2 May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED May 28 10:47:01 rbm-fe-s2-h1 osafntfd[21898]: ER ntfs_mds_msg_send FAILED This all then leads to an opensaf AMF director heart beat timeout which forces a restart: May 28 10:47:58 rbm-fe-s2-h1 osafdtmd[21729]: NO Lost contact with 'rbm-fe-s1-h2' May 28 10:47:58 rbm-fe-s2-h1 osafimmd[21801]: NO MDS event from svc_id 25 (change:4, dest:604795819815973) May 28 10:47:58 rbm-fe-s2-h1 osafimmnd[21819]: NO Global discard node received for nodeId:2260f pid:29733 May 28 10:47:58 rbm-fe-s2-h1 osafimmnd[21819]: NO Implementer disconnected 14 <0, 2260f(down)> (MsgQueueService140815) May 28 10:47:58 rbm-fe-s2-h1 osafclmd[21915]: NO Node 140815 went down. Not sending track callback for agents on that node May 28 10:47:58 rbm-fe-s2-h1 osafclmd[21915]: NO Node 140815 went down. Not sending track callback for agents on that node May 28 10:47:58 rbm-fe-s2-h1 osafclmd[21915]: NO Node 140815 went down. Not sending track callback for agents on that node May 28 10:47:59 rbm-fe-s2-h1 osafamfnd[21949]: Rebooting OpenSAF NodeId = 148751 EE Name = , Reason: AMF director heart beat timeout, OwnNodeId = 148751, SupervisionTime = 0 May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Creating lock file: /tmp/opensaf_bounce.lock May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Bouncing local node; timeout= May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: node ID = 148751 May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Current user name = root May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Current user ID = 0 May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Start force shut down of osafdtmd, PID = 21729 May 28 10:47:59 rbm-fe-s2-h1 opensaf_bounce: Waiting for 5 seconds to allow processes to shut down thanks ________________________________ The information transmitted herein is intended only for the person or entity to which it is addressed and may contain confidential, proprietary and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. _______________________________________________ Opensaf-users mailing list Opensaf-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-users