Hi all, System setup: 2 SCs running version opensaf-5.19.10 plus patches: amf-3103.patch, amf-3147.patch, amf-3149.patch, dtm-3192.patch Nodes: VMs running RHEL 7
Scenario: On the host, the VM running with the standby osaf controller is suspended via: virsh suspend <node-name> The active osaf generates a osafdtmd log indicating that a TCP timeout has occurred (errno=110 - see logs below). As a result, the TCP sockets are closed on the node with the active osaf. The standby is restarted, comes up and now that osaf controller also goes active - split-brain (since the original active controller is still running as active). Tried: - restarting the original standby osaf, same result. - rebooting the node running with the original standby osaf, same result - restart the original active osaf process. This worked. So, is this expected? Why does it require a restart of the original active osaf to get the TCP sockets re-opened? Should these sockets be tested periodically? Not ideal since the applications being managed by osaf get terminated as part of the osaf restart. Here are the logs from the node running with the active osaf controller: Mar 17 13:45:38 node-sc1 osafdtmd[1795]: ER recv() from node 0x2020f failed, errno=110 Mar 17 13:45:38 node-sc1 osafdtmd[1795]: NO Lost contact with 'SC-2' Mar 17 13:45:38 node-sc1 osafimmd[1934]: NO MDS event from svc_id 24 (change:6, dest:13) Mar 17 13:45:38 node-sc1 osafimmd[1934]: NO MDS event from svc_id 25 (change:4, dest:565213401188265) Mar 17 13:45:38 node-sc1 osafimmd[1934]: WA IMMD lost contact with peer IMMD (NCSMDS_RED_DOWN) Mar 17 13:45:38 node-sc1 osafrded[1860]: NO Peer down on node 0x2020f Mar 17 13:45:38 node-sc1 osafclmd[2005]: NO Node 131599 went down. Not sending track callback for agents on that node Mar 17 13:45:38 node-sc1 osafclmd[2005]: NO Node 131599 went down. Not sending track callback for agents on that node Mar 17 13:45:38 node-sc1 osafclmd[2005]: NO Node 131599 went down. Not sending track callback for agents on that node Mar 17 13:45:38 node-sc1 osafclmd[2005]: NO Node 131599 went down. Not sending track callback for agents on that node Mar 17 13:45:38 node-sc1 osafamfd[2023]: NO Node 'SC-2' left the cluster Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO AVD down on: 2020f Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO AMFND down on: 2020f Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO FM down on: 2020f Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO IMMD down on: 2020f Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO IMMND down on: 2020f Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO Core services went down on node_id: 2020f Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO Node Down event for node id 2020f: Mar 17 13:45:38 node-sc1 osaffmd[1900]: NO Current role: ACTIVE Regards, David Notice: This e-mail together with any attachments may contain information of Ribbon Communications Inc. and its Affiliates that is confidential and/or proprietary for the sole use of the intended recipient. Any review, disclosure, reliance or distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please notify the sender immediately and then delete all copies, including any attachments. _______________________________________________ Opensaf-users mailing list Opensaf-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-users