Hi all,

System setup:
2 SCs running version opensaf-5.19.10 plus patches: amf-3103.patch, 
amf-3147.patch, amf-3149.patch, dtm-3192.patch
Nodes: VMs running RHEL 7

Scenario:
On the host, the VM running with the standby osaf controller is suspended via: 
virsh suspend <node-name>

The active osaf generates a osafdtmd log indicating that a TCP timeout has 
occurred (errno=110 - see logs below).
As a result, the TCP sockets are closed on the node with the active osaf.
The standby is restarted, comes up and now that osaf controller also goes 
active - split-brain (since the original active controller is still running as 
active).

Tried:

-       restarting the original standby osaf, same result.

-       rebooting the node running with the original standby osaf, same result

-       restart the original active osaf process. This worked.


So, is this expected?
Why does it require a restart of the original active osaf to get the TCP 
sockets re-opened?
Should these sockets be tested periodically?
Not ideal since the applications being managed by osaf get terminated as part 
of the osaf restart.

Here are the logs from the node running with the active osaf controller:
Mar 17 13:45:38 node-sc1 osafdtmd[1795]: ER recv() from node 0x2020f failed, 
errno=110
Mar 17 13:45:38 node-sc1 osafdtmd[1795]: NO Lost contact with 'SC-2'
Mar 17 13:45:38 node-sc1 osafimmd[1934]: NO MDS event from svc_id 24 (change:6, 
dest:13)
Mar 17 13:45:38 node-sc1 osafimmd[1934]: NO MDS event from svc_id 25 (change:4, 
dest:565213401188265)
Mar 17 13:45:38 node-sc1 osafimmd[1934]: WA IMMD lost contact with peer IMMD 
(NCSMDS_RED_DOWN)
Mar 17 13:45:38 node-sc1 osafrded[1860]: NO Peer down on node 0x2020f
Mar 17 13:45:38 node-sc1 osafclmd[2005]: NO Node 131599 went down. Not sending 
track callback for agents on that node
Mar 17 13:45:38 node-sc1 osafclmd[2005]: NO Node 131599 went down. Not sending 
track callback for agents on that node
Mar 17 13:45:38 node-sc1 osafclmd[2005]: NO Node 131599 went down. Not sending 
track callback for agents on that node
Mar 17 13:45:38 node-sc1 osafclmd[2005]: NO Node 131599 went down. Not sending 
track callback for agents on that node
Mar 17 13:45:38 node-sc1 osafamfd[2023]: NO Node 'SC-2' left the cluster
Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO AVD down on: 2020f
Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO AMFND down on: 2020f
Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO FM down on: 2020f
Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO IMMD down on: 2020f
Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO IMMND down on: 2020f
Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO Core services went down on node_id: 
2020f
Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO Node Down event for node id 2020f:
Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO Current role: ACTIVE


Regards,
David


Notice: This e-mail together with any attachments may contain information of 
Ribbon Communications Inc. and its Affiliates that is confidential and/or 
proprietary for the sole use of the intended recipient. Any review, disclosure, 
reliance or distribution by others or forwarding without express permission is 
strictly prohibited. If you are not the intended recipient, please notify the 
sender immediately and then delete all copies, including any attachments.
_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to