Hi Gary,

We have been running some test cases to verify the consensus service and are 
running into an issue.
The test case is: From the node with the active osaf controller, issue an 
iptables drop command of the mate's IP address.
Expected outcome: original standby should lose connection with active 
controller and try and take activity.
However, since the active osaf controller can still communicate with etcd, it 
should reject the takeover request.

Issue: the original standby's takeover request does get rejected, but shortly 
afterwards, the active osaf controller lost its connections with etcd and also 
goes for a reboot.

Let's say the IPs are as follows:
Node with osaf Active: 123.45.6.77
Node with osaf Standby: 123.45.6.88

>From the 'Active' node, trhe following iptables command was issued:
iptables -I INPUT 1 -s 123.45.6.88 -j DROP

As stated, the original standby tried to go active but was rejected and 
rebooted. That's expected.
As stated above, after the original active rejected the request, shortly 
afterwards, it lost connection with etcd.
And yes, the 3 etcd servers had a different IP than the original standby's IP.

Here are the logs from the original Active:
Jul 30 16:01:49 dhoyt-ha-1 osafdtmd[5862]: ER recv() from node 0x2030f failed, 
errno=110
Jul 30 16:01:49 dhoyt-ha-1 osafdtmd[5862]: NO Lost contact with 'SC-3'
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO AVD down on: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO AMFND down on: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO FM down on: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO IMMD down on: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO Core services went down on 
node_id: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO IMMND down on: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO Core services went down on 
node_id: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO Node Down event for node id 2030f:
Jul 30 16:01:49 dhoyt-ha-1 osafimmd[5938]: NO MDS event from svc_id 24 
(change:6, dest:13)
Jul 30 16:01:49 dhoyt-ha-1 osafimmd[5938]: NO MDS event from svc_id 25 
(change:4, dest:566312912819188)
Jul 30 16:01:49 dhoyt-ha-1 osafimmd[5938]: WA IMMD lost contact with peer IMMD 
(NCSMDS_RED_DOWN)
Jul 30 16:01:49 dhoyt-ha-1 osafrded[5905]: NO Peer down on node 0x2030f
Jul 30 16:01:49 dhoyt-ha-1 osafclmd[6701]: NO Node 131855 went down. Not 
sending track callback for agents on that node
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO Current role: ACTIVE
Jul 30 16:01:49 dhoyt-ha-1 osafclmd[6701]: NO Node 131855 went down. Not 
sending track callback for agents on that node
Jul 30 16:01:49 dhoyt-ha-1 osafclmd[6701]: NO Node 131855 went down. Not 
sending track callback for agents on that node
Jul 30 16:01:49 dhoyt-ha-1 osafclmd[6701]: NO Node 131855 went down. Not 
sending track callback for agents on that node
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: Rebooting OpenSAF NodeId = 131855 EE 
Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, 
SupervisionTime = 60
Jul 30 16:01:49 dhoyt-ha-1 osafamfd[6718]: NO Node 'SC-3' is down. Start 
failover delay timer
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO Recent fevs:
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO 
<2935>[IMMND_EVT_A2ND_OI_OBJ_MODIFY -> safSg=ProcessMonitor,safApp=Insight]
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO 
<2936>[IMMND_EVT_A2ND_OI_OBJ_MODIFY -> safSg=ProcessMonitor,safApp=Insight]
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO 
<2937>[IMMND_EVT_A2ND_ADMO_RELEASE -> admo_id:5]
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO 
<2938>[IMMND_EVT_A2ND_ADMO_FINALIZE -> admo_id:5]
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO 
<2939>[IMMND_EVT_D2ND_DISCARD_NODE -> node_id:2030f]
Jul 30 16:01:49 dhoyt-ha-1 osafamfd[6718]: NO Start timer for '2030f'
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO Global discard node received for 
nodeId:2030f pid:5108
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO Implementer disconnected 7 <0, 
2030f(down)> (@safAmfService2030f)
Jul 30 16:01:49 dhoyt-ha-1 osafamfd[6718]: NO (Consensus::IsWritable): {1} 
Calling 'Set'
Jul 30 16:01:49 dhoyt-ha-1 opensaf_reboot: Rebooting remote node in the absence 
of PLM is outside the scope of OpenSAF
Jul 30 16:01:49 dhoyt-ha-1 osafamfd[6718]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin set "opensaf_write_test" "SC-1" 0', returning 0
Jul 30 16:01:57 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin watch "takeover_request"', returning 0
Jul 30 16:01:57 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::WatchKeyFunction): 
Read 'SC-1 SC-3 1 NEW'
Jul 30 16:01:57 dhoyt-ha-1 osafrded[5905]: NO (Consensus::WriteTakeoverResult): 
Found 'SC-1 SC-3 1 NEW'
Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO 
(Consensus::HandleTakeoverRequest): Calling 'ParseTakeoverRequest'
Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO (Consensus::WriteTakeoverResult): 
Found 'SC-1 SC-3 1 NEW'
Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO 
(Consensus::HandleTakeoverRequest): Calling 'WriteTakeoverResult'
Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO TakeoverResult: SC-1 SC-3 1 
REJECTED
Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO (Consensus::WriteTakeoverResult): 
Calling 'Set'
Jul 30 16:02:02 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin set_if_prev "takeover_request" "SC-1 SC-3 1 
REJECTED" "SC-1 SC-3 1 NEW" 20', returning 0
Jul 30 16:02:02 dhoyt-ha-1 osafrded[5905]: NO 
(Consensus::MonitorTakeoverRequest): Calling 'Watch'
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin watch "takeover_request"', returning 0
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::WatchKeyFunction): 
Read ''
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Could not read takeover request (7)
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Could not read takeover request (7)
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Could not read takeover request (7)
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Could not read takeover request (7)
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Could not read takeover request (7)
Jul 30 16:02:29 dhoyt-ha-1 osafamfd[6718]: NO Node failover timeout
Jul 30 16:02:29 dhoyt-ha-1 osafamfd[6718]: NO Completing delayed node failover 
for 'SC-3'
Jul 30 16:02:29 dhoyt-ha-1 osafamfd[6718]: NO Node 'SC-3' left the cluster
Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO Empty takeover request from watch 
command. Read it again.
Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO 
(Consensus::HandleTakeoverRequest): {1} Calling 'ReadTakeoverRequest'
Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'
Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Could not read takeover request (7)
Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO 
(Consensus::HandleTakeoverRequest): {2} Calling 'ReadTakeoverRequest'
Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'
Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Could not read takeover request (7)
Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO 
(Consensus::HandleTakeoverRequest): {2} Calling 'ReadTakeoverRequest'
Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'
Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Could not read takeover request (7)
Jul 30 16:02:32 dhoyt-ha-1 osafrded[5905]: NO 
(Consensus::HandleTakeoverRequest): {2} Calling 'ReadTakeoverRequest'
Jul 30 16:02:32 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'
Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest): 
Could not read takeover request (7)
Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: NO Lost connectivity to consensus 
service
Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: Quick local node rebooting, Reason: 
Lost connectivity to consensus service. Rebooting this node
Jul 30 16:02:33 dhoyt-ha-1 opensaf_reboot: Do quick local node reboot

After the takeover request was rejected, it does a MonitorTakeoverRequest and 
then does a watch of the takeover_request.
It performs this 10 times and then logs that it loct connectivity to the 
consensus service.
Yet, all three etcd servers are still up and running fine.

Any ideas?

Regards,
David


-----------------------------------------------------------------------------------------------------------------------
Notice: This e-mail together with any attachments may contain information of 
Ribbon Communications Inc. that
is confidential and/or proprietary for the sole use of the intended recipient.  
Any review, disclosure, reliance or
distribution by others or forwarding without express permission is strictly 
prohibited.  If you are not the intended
recipient, please notify the sender immediately and then delete all copies, 
including any attachments.
-----------------------------------------------------------------------------------------------------------------------

_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to