Hi Gary, We have been running some test cases to verify the consensus service and are running into an issue. The test case is: From the node with the active osaf controller, issue an iptables drop command of the mate's IP address. Expected outcome: original standby should lose connection with active controller and try and take activity. However, since the active osaf controller can still communicate with etcd, it should reject the takeover request.
Issue: the original standby's takeover request does get rejected, but shortly afterwards, the active osaf controller lost its connections with etcd and also goes for a reboot. Let's say the IPs are as follows: Node with osaf Active: 123.45.6.77 Node with osaf Standby: 123.45.6.88 >From the 'Active' node, trhe following iptables command was issued: iptables -I INPUT 1 -s 123.45.6.88 -j DROP As stated, the original standby tried to go active but was rejected and rebooted. That's expected. As stated above, after the original active rejected the request, shortly afterwards, it lost connection with etcd. And yes, the 3 etcd servers had a different IP than the original standby's IP. Here are the logs from the original Active: Jul 30 16:01:49 dhoyt-ha-1 osafdtmd[5862]: ER recv() from node 0x2030f failed, errno=110 Jul 30 16:01:49 dhoyt-ha-1 osafdtmd[5862]: NO Lost contact with 'SC-3' Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO AVD down on: 2030f Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO AMFND down on: 2030f Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO FM down on: 2030f Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO IMMD down on: 2030f Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO Core services went down on node_id: 2030f Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO IMMND down on: 2030f Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO Core services went down on node_id: 2030f Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO Node Down event for node id 2030f: Jul 30 16:01:49 dhoyt-ha-1 osafimmd[5938]: NO MDS event from svc_id 24 (change:6, dest:13) Jul 30 16:01:49 dhoyt-ha-1 osafimmd[5938]: NO MDS event from svc_id 25 (change:4, dest:566312912819188) Jul 30 16:01:49 dhoyt-ha-1 osafimmd[5938]: WA IMMD lost contact with peer IMMD (NCSMDS_RED_DOWN) Jul 30 16:01:49 dhoyt-ha-1 osafrded[5905]: NO Peer down on node 0x2030f Jul 30 16:01:49 dhoyt-ha-1 osafclmd[6701]: NO Node 131855 went down. Not sending track callback for agents on that node Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO Current role: ACTIVE Jul 30 16:01:49 dhoyt-ha-1 osafclmd[6701]: NO Node 131855 went down. Not sending track callback for agents on that node Jul 30 16:01:49 dhoyt-ha-1 osafclmd[6701]: NO Node 131855 went down. Not sending track callback for agents on that node Jul 30 16:01:49 dhoyt-ha-1 osafclmd[6701]: NO Node 131855 went down. Not sending track callback for agents on that node Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: Rebooting OpenSAF NodeId = 131855 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60 Jul 30 16:01:49 dhoyt-ha-1 osafamfd[6718]: NO Node 'SC-3' is down. Start failover delay timer Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO Recent fevs: Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO <2935>[IMMND_EVT_A2ND_OI_OBJ_MODIFY -> safSg=ProcessMonitor,safApp=Insight] Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO <2936>[IMMND_EVT_A2ND_OI_OBJ_MODIFY -> safSg=ProcessMonitor,safApp=Insight] Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO <2937>[IMMND_EVT_A2ND_ADMO_RELEASE -> admo_id:5] Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO <2938>[IMMND_EVT_A2ND_ADMO_FINALIZE -> admo_id:5] Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO <2939>[IMMND_EVT_D2ND_DISCARD_NODE -> node_id:2030f] Jul 30 16:01:49 dhoyt-ha-1 osafamfd[6718]: NO Start timer for '2030f' Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO Global discard node received for nodeId:2030f pid:5108 Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO Implementer disconnected 7 <0, 2030f(down)> (@safAmfService2030f) Jul 30 16:01:49 dhoyt-ha-1 osafamfd[6718]: NO (Consensus::IsWritable): {1} Calling 'Set' Jul 30 16:01:49 dhoyt-ha-1 opensaf_reboot: Rebooting remote node in the absence of PLM is outside the scope of OpenSAF Jul 30 16:01:49 dhoyt-ha-1 osafamfd[6718]: NO (KeyValue::Execute): Executed '/opt/opensaf/osaf-etcd3.plugin set "opensaf_write_test" "SC-1" 0', returning 0 Jul 30 16:01:57 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed '/opt/opensaf/osaf-etcd3.plugin watch "takeover_request"', returning 0 Jul 30 16:01:57 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::WatchKeyFunction): Read 'SC-1 SC-3 1 NEW' Jul 30 16:01:57 dhoyt-ha-1 osafrded[5905]: NO (Consensus::WriteTakeoverResult): Found 'SC-1 SC-3 1 NEW' Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO (Consensus::HandleTakeoverRequest): Calling 'ParseTakeoverRequest' Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO (Consensus::WriteTakeoverResult): Found 'SC-1 SC-3 1 NEW' Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO (Consensus::HandleTakeoverRequest): Calling 'WriteTakeoverResult' Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO TakeoverResult: SC-1 SC-3 1 REJECTED Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO (Consensus::WriteTakeoverResult): Calling 'Set' Jul 30 16:02:02 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed '/opt/opensaf/osaf-etcd3.plugin set_if_prev "takeover_request" "SC-1 SC-3 1 REJECTED" "SC-1 SC-3 1 NEW" 20', returning 0 Jul 30 16:02:02 dhoyt-ha-1 osafrded[5905]: NO (Consensus::MonitorTakeoverRequest): Calling 'Watch' Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed '/opt/opensaf/osaf-etcd3.plugin watch "takeover_request"', returning 0 Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::WatchKeyFunction): Read '' Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Calling 'Get' Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed '/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1 Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: '' Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Could not read takeover request (7) Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Calling 'Get' Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed '/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1 Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: '' Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Could not read takeover request (7) Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Calling 'Get' Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed '/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1 Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: '' Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Could not read takeover request (7) Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Calling 'Get' Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed '/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1 Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: '' Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Could not read takeover request (7) Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Calling 'Get' Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed '/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1 Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: '' Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Could not read takeover request (7) Jul 30 16:02:29 dhoyt-ha-1 osafamfd[6718]: NO Node failover timeout Jul 30 16:02:29 dhoyt-ha-1 osafamfd[6718]: NO Completing delayed node failover for 'SC-3' Jul 30 16:02:29 dhoyt-ha-1 osafamfd[6718]: NO Node 'SC-3' left the cluster Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO Empty takeover request from watch command. Read it again. Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO (Consensus::HandleTakeoverRequest): {1} Calling 'ReadTakeoverRequest' Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Calling 'Get' Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed '/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1 Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: '' Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Could not read takeover request (7) Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO (Consensus::HandleTakeoverRequest): {2} Calling 'ReadTakeoverRequest' Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Calling 'Get' Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed '/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1 Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: '' Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Could not read takeover request (7) Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO (Consensus::HandleTakeoverRequest): {2} Calling 'ReadTakeoverRequest' Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Calling 'Get' Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed '/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1 Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: '' Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Could not read takeover request (7) Jul 30 16:02:32 dhoyt-ha-1 osafrded[5905]: NO (Consensus::HandleTakeoverRequest): {2} Calling 'ReadTakeoverRequest' Jul 30 16:02:32 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Calling 'Get' Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed '/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1 Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: '' Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): Could not read takeover request (7) Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: NO Lost connectivity to consensus service Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: Quick local node rebooting, Reason: Lost connectivity to consensus service. Rebooting this node Jul 30 16:02:33 dhoyt-ha-1 opensaf_reboot: Do quick local node reboot After the takeover request was rejected, it does a MonitorTakeoverRequest and then does a watch of the takeover_request. It performs this 10 times and then logs that it loct connectivity to the consensus service. Yet, all three etcd servers are still up and running fine. Any ideas? Regards, David ----------------------------------------------------------------------------------------------------------------------- Notice: This e-mail together with any attachments may contain information of Ribbon Communications Inc. that is confidential and/or proprietary for the sole use of the intended recipient. Any review, disclosure, reliance or distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please notify the sender immediately and then delete all copies, including any attachments. ----------------------------------------------------------------------------------------------------------------------- _______________________________________________ Opensaf-users mailing list Opensaf-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-users