I'm looking at extending the initial discovery timeout. We have a case where both SC nodes were rebooted, and started a few seconds apart. SC-1 became active, started the application but that app makes an ethernet change and performs an if-down and if-up.
Meanwhile, SC-2 started, and just so happened to send its initial broadcast while SC-1's Ethernet was going through this if-down/up. As a result, SC-2 didn't receive a response to its broadcast message and went Active. Thus, SC-1 and SC-2 were split-brain. I thought I could change the DTM_INI_DIS_TIMEOUT_SECS from 5 to 15, just to see if this would help. I made it a large number just to see if opensaf would actually wait that long before deciding to proceed with going Active. DTM_INI_DIS_TIMEOUT_SECS=15 I then started opensaf on the SC-1 node (opensaf was stopped on SC-2). However, based on the timestamps of the logs, I'm not seeing opensaf wait 15 seconds before proceeding to go Active, it was only 3 seconds: Jul 9 10:19:47 dhoyt-sc-1 osafrded[29131]: NO Requesting ACTIVE role ... Jul 9 10:19:50 dhoyt-sc-1 osafrded[29131]: NO Switched to ACTIVE from Undefined Is my assumption wrong or did I update the wrong variable? Here's the startup logs for SC-1: Jul 9 10:19:47 dhoyt-sc-1 opensafd: Starting OpenSAF Services(5 - @OSAF_LONG_GIT_REV@) (Using TCP) Jul 9 10:19:47 dhoyt-sc-1 opensafd: Reboot file /var/log/opensaf/clm_cluster_reboot_in_progress not found, startup continue... Jul 9 10:19:47 dhoyt-sc-1 opensafd[29075]: logtrace: trace enabled to file 'opensafd.log', mask=0x0 Jul 9 10:19:47 dhoyt-sc-1 opensafd[29075]: NO svc_monitor_thread is up and in ready state Jul 9 10:19:47 dhoyt-sc-1 osafdtmd[29088]: mkfifo already exists: /var/lib/opensaf/osafdtmd.fifo File exists Jul 9 10:19:47 dhoyt-sc-1 osafdtmd[29088]: Started Jul 9 10:19:47 dhoyt-sc-1 osaftransportd[29101]: mkfifo already exists: /var/lib/opensaf/osaftransportd.fifo File exists Jul 9 10:19:47 dhoyt-sc-1 osaftransportd[29101]: Started Jul 9 10:19:47 dhoyt-sc-1 osafdtmd[29088]: NO Cluster name: 'dhoyt-gr' (from file '/opt/opensaf/peers') Jul 9 10:19:47 dhoyt-sc-1 osafdtmd[29088]: NO Number of unicast peers: 1 Jul 9 10:19:47 dhoyt-sc-1 opensafd[29075]: NO Monitoring of TRANSPORT started Jul 9 10:19:47 dhoyt-sc-1 osafclmna[29115]: mkfifo already exists: /var/lib/opensaf/osafclmna.fifo File exists Jul 9 10:19:47 dhoyt-sc-1 osafclmna[29115]: Started Jul 9 10:19:47 dhoyt-sc-1 opensafd[29075]: NO Monitoring of CLMNA started Jul 9 10:19:47 dhoyt-sc-1 osafrded[29131]: mkfifo already exists: /var/lib/opensaf/osafrded.fifo File exists Jul 9 10:19:47 dhoyt-sc-1 osafrded[29131]: Started Jul 9 10:19:47 dhoyt-sc-1 opensafd[29075]: NO Monitoring of RDE started Jul 9 10:19:47 dhoyt-sc-1 osaffmd[29147]: mkfifo already exists: /var/lib/opensaf/osaffmd.fifo File exists Jul 9 10:19:47 dhoyt-sc-1 osaffmd[29147]: Started Jul 9 10:19:47 dhoyt-sc-1 osaffmd[29147]: NO Configuration file located at /etc/opensaf/fmd.conf Jul 9 10:19:47 dhoyt-sc-1 opensafd[29075]: NO Monitoring of HLFM started Jul 9 10:19:47 dhoyt-sc-1 osafimmd[29164]: mkfifo already exists: /var/lib/opensaf/osafimmd.fifo File exists Jul 9 10:19:47 dhoyt-sc-1 osafimmd[29164]: Started Jul 9 10:19:47 dhoyt-sc-1 osafimmd[29164]: NO ******* SC_ABSENCE_ALLOWED (Headless Hydra) is configured: 900 *********** Jul 9 10:19:47 dhoyt-sc-1 opensafd[29075]: NO Monitoring of IMMD started Jul 9 10:19:47 dhoyt-sc-1 osafimmnd[29182]: mkfifo already exists: /var/lib/opensaf/osafimmnd.fifo File exists Jul 9 10:19:47 dhoyt-sc-1 osafimmnd[29182]: Started Jul 9 10:19:47 dhoyt-sc-1 osafimmnd[29182]: NO Use default reserved class names. Jul 9 10:19:47 dhoyt-sc-1 osafclmna[29115]: NO Starting to promote this node to a system controller Jul 9 10:19:47 dhoyt-sc-1 osafrded[29131]: NO Requesting ACTIVE role Jul 9 10:19:47 dhoyt-sc-1 osafrded[29131]: NO RDE role set to Undefined Jul 9 10:19:50 dhoyt-sc-1 osafrded[29131]: NO Active controller set to SC-1 Jul 9 10:19:50 dhoyt-sc-1 osafrded[29131]: NO Running '/usr/lib64/opensaf/opensaf_sc_active' with 0 argument(s) Jul 9 10:19:50 dhoyt-sc-1 osafrded[29131]: NO Switched to ACTIVE from Undefined Jul 9 10:19:50 dhoyt-sc-1 osaffmd[29147]: NO Controller promoted. Stop supervision timer Regards, David ----------------------------------------------------------------------------------------------------------------------- Notice: This e-mail together with any attachments may contain information of Ribbon Communications Inc. that is confidential and/or proprietary for the sole use of the intended recipient. Any review, disclosure, reliance or distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please notify the sender immediately and then delete all copies, including any attachments. ----------------------------------------------------------------------------------------------------------------------- _______________________________________________ Opensaf-users mailing list Opensaf-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-users