I'm looking at extending the initial discovery timeout.
We have a case where both SC nodes were rebooted, and started a few seconds 
apart.
SC-1 became active, started the application but that app makes an ethernet 
change and performs an if-down and if-up.

Meanwhile, SC-2 started, and just so happened to send its initial broadcast 
while SC-1's Ethernet was going through this if-down/up.
As a result, SC-2 didn't receive a response to its broadcast message and went 
Active.
Thus, SC-1 and SC-2 were split-brain.

I thought I could change the DTM_INI_DIS_TIMEOUT_SECS from 5 to 15, just to see 
if this would help. I made it a large number just to see if opensaf would 
actually wait that long before deciding to proceed with going Active.

DTM_INI_DIS_TIMEOUT_SECS=15

I then started opensaf on the SC-1 node (opensaf was stopped on SC-2).
However, based on the timestamps of the logs, I'm not seeing opensaf wait 15 
seconds before proceeding to go Active, it was only 3 seconds:
Jul  9 10:19:47 dhoyt-sc-1 osafrded[29131]: NO Requesting ACTIVE role
...
Jul  9 10:19:50 dhoyt-sc-1 osafrded[29131]: NO Switched to ACTIVE from Undefined

Is my assumption wrong or did I update the wrong variable?

Here's the startup logs for SC-1:
Jul  9 10:19:47 dhoyt-sc-1 opensafd: Starting OpenSAF Services(5 - 
@OSAF_LONG_GIT_REV@) (Using TCP)
Jul  9 10:19:47 dhoyt-sc-1 opensafd: Reboot file 
/var/log/opensaf/clm_cluster_reboot_in_progress not found, startup continue...
Jul  9 10:19:47 dhoyt-sc-1 opensafd[29075]: logtrace: trace enabled to file 
'opensafd.log', mask=0x0
Jul  9 10:19:47 dhoyt-sc-1 opensafd[29075]: NO svc_monitor_thread is up and in 
ready state
Jul  9 10:19:47 dhoyt-sc-1 osafdtmd[29088]: mkfifo already exists: 
/var/lib/opensaf/osafdtmd.fifo File exists
Jul  9 10:19:47 dhoyt-sc-1 osafdtmd[29088]: Started
Jul  9 10:19:47 dhoyt-sc-1 osaftransportd[29101]: mkfifo already exists: 
/var/lib/opensaf/osaftransportd.fifo File exists
Jul  9 10:19:47 dhoyt-sc-1 osaftransportd[29101]: Started
Jul  9 10:19:47 dhoyt-sc-1 osafdtmd[29088]: NO Cluster name: 'dhoyt-gr' (from 
file '/opt/opensaf/peers')
Jul  9 10:19:47 dhoyt-sc-1 osafdtmd[29088]: NO Number of unicast peers: 1
Jul  9 10:19:47 dhoyt-sc-1 opensafd[29075]: NO Monitoring of TRANSPORT started
Jul  9 10:19:47 dhoyt-sc-1 osafclmna[29115]: mkfifo already exists: 
/var/lib/opensaf/osafclmna.fifo File exists
Jul  9 10:19:47 dhoyt-sc-1 osafclmna[29115]: Started
Jul  9 10:19:47 dhoyt-sc-1 opensafd[29075]: NO Monitoring of CLMNA started
Jul  9 10:19:47 dhoyt-sc-1 osafrded[29131]: mkfifo already exists: 
/var/lib/opensaf/osafrded.fifo File exists
Jul  9 10:19:47 dhoyt-sc-1 osafrded[29131]: Started
Jul  9 10:19:47 dhoyt-sc-1 opensafd[29075]: NO Monitoring of RDE started
Jul  9 10:19:47 dhoyt-sc-1 osaffmd[29147]: mkfifo already exists: 
/var/lib/opensaf/osaffmd.fifo File exists
Jul  9 10:19:47 dhoyt-sc-1 osaffmd[29147]: Started
Jul  9 10:19:47 dhoyt-sc-1 osaffmd[29147]: NO Configuration file located at 
/etc/opensaf/fmd.conf
Jul  9 10:19:47 dhoyt-sc-1 opensafd[29075]: NO Monitoring of HLFM started
Jul  9 10:19:47 dhoyt-sc-1 osafimmd[29164]: mkfifo already exists: 
/var/lib/opensaf/osafimmd.fifo File exists
Jul  9 10:19:47 dhoyt-sc-1 osafimmd[29164]: Started
Jul  9 10:19:47 dhoyt-sc-1 osafimmd[29164]: NO ******* SC_ABSENCE_ALLOWED 
(Headless Hydra) is configured: 900 ***********
Jul  9 10:19:47 dhoyt-sc-1 opensafd[29075]: NO Monitoring of IMMD started
Jul  9 10:19:47 dhoyt-sc-1 osafimmnd[29182]: mkfifo already exists: 
/var/lib/opensaf/osafimmnd.fifo File exists
Jul  9 10:19:47 dhoyt-sc-1 osafimmnd[29182]: Started
Jul  9 10:19:47 dhoyt-sc-1 osafimmnd[29182]: NO Use default reserved class 
names.
Jul  9 10:19:47 dhoyt-sc-1 osafclmna[29115]: NO Starting to promote this node 
to a system controller
Jul  9 10:19:47 dhoyt-sc-1 osafrded[29131]: NO Requesting ACTIVE role
Jul  9 10:19:47 dhoyt-sc-1 osafrded[29131]: NO RDE role set to Undefined
Jul  9 10:19:50 dhoyt-sc-1 osafrded[29131]: NO Active controller set to SC-1
Jul  9 10:19:50 dhoyt-sc-1 osafrded[29131]: NO Running 
'/usr/lib64/opensaf/opensaf_sc_active' with 0 argument(s)
Jul  9 10:19:50 dhoyt-sc-1 osafrded[29131]: NO Switched to ACTIVE from Undefined
Jul  9 10:19:50 dhoyt-sc-1 osaffmd[29147]: NO Controller promoted. Stop 
supervision timer


Regards,
David


-----------------------------------------------------------------------------------------------------------------------
Notice: This e-mail together with any attachments may contain information of 
Ribbon Communications Inc. that
is confidential and/or proprietary for the sole use of the intended recipient.  
Any review, disclosure, reliance or
distribution by others or forwarding without express permission is strictly 
prohibited.  If you are not the intended
recipient, please notify the sender immediately and then delete all copies, 
including any attachments.
-----------------------------------------------------------------------------------------------------------------------

_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to