Hi Gary,
Sorry for the late response.
I found I had to make a change to the procedure handle_mbx_event() in file
/src/rde/rded/rde_main.cc to get things working.
Recap:
After the original Active REJECTED the standby’s takeover_request, the standby
went for a reboot. This is all good and expected.
Next, the active went back and performed several “gets” on the
takeover_request. These all returned an empty string. Again, I would think this
is expected now that the standby has rebooted.
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Could not read takeover request (7)
After performing 10 get takeover_request operations, the active concludes that
it has lost connectivity to the consensus server and as a result, reboots the
active node. This is NOT good.
Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: NO Lost connectivity to consensus
service
Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: Quick local node rebooting, Reason:
Lost connectivity to consensus service. Rebooting this node
Jul 30 16:02:33 dhoyt-ha-1 opensaf_reboot: Do quick local node reboot
However, it really didn’t loose connectivity as etcd was responding, but the
data string was empty.
As stated, I found that the procedure handle_mbx_event() in file
/src/rde/rded/rde_main.cc was making the decision that the connection to the
consensus service was lost. As a result, I made the following change and all
works fine now:
--- a/src/rde/rded/rde_main.cc 2019-10-21 20:51:33.000000000 -0400
+++ b/src/rde/rded/rde_main.cc 2020-08-04 18:30:12.665384942 -0400
@@ -244,17 +250,21 @@
}
}
if (fencing_required == true) {
- LOG_NO("Lost connectivity to consensus service");
- if (consensus_service.IsRemoteFencingEnabled() == false) {
- opensaf_quick_reboot("Lost connectivity to consensus service. "
- "Rebooting this node");
- }
+
+ //LOG_NO("Lost connectivity to consensus service");
+ //if (consensus_service.IsRemoteFencingEnabled() == false) {
+ // opensaf_quick_reboot("Lost connectivity to consensus
service. "
+ // "Rebooting this node");
+ //}
+ LOG_NO("(RDE_MSG_TAKEOVER_REQUEST_CALLBACK): for now, do nothing");
Regards,
David
From: Gary Lee <[email protected]>
Sent: Thursday, July 30, 2020 7:55 PM
To: Hoyt, David <[email protected]>
Cc: [email protected]
Subject: Re: losing connectivity to consensus service
________________________________
NOTICE: This email was received from an EXTERNAL sender
________________________________
Hi David
I can't see an obvious cause - do you have the etcd logs? Do they give a clue?
Gary
________________________________
From: Hoyt, David <[email protected]<mailto:[email protected]>>
Sent: 31 July 2020 06:24
To: Gary Lee <[email protected]<mailto:[email protected]>>
Cc:
[email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Subject: losing connectivity to consensus service
Hi Gary,
We have been running some test cases to verify the consensus service and are
running into an issue.
The test case is: From the node with the active osaf controller, issue an
iptables drop command of the mate’s IP address.
Expected outcome: original standby should lose connection with active
controller and try and take activity.
However, since the active osaf controller can still communicate with etcd, it
should reject the takeover request.
Issue: the original standby’s takeover request does get rejected, but shortly
afterwards, the active osaf controller lost its connections with etcd and also
goes for a reboot.
Let’s say the IPs are as follows:
Node with osaf Active: 123.45.6.77<http://123.45.6.77>
Node with osaf Standby: 123.45.6.88<http://123.45.6.88>
From the ‘Active’ node, trhe following iptables command was issued:
iptables -I INPUT 1 -s 123.45.6.88<http://123.45.6.88> -j DROP
As stated, the original standby tried to go active but was rejected and
rebooted. That’s expected.
As stated above, after the original active rejected the request, shortly
afterwards, it lost connection with etcd.
And yes, the 3 etcd servers had a different IP than the original standby’s IP.
Here are the logs from the original Active:
Jul 30 16:01:49 dhoyt-ha-1 osafdtmd[5862]: ER recv() from node 0x2030f failed,
errno=110
Jul 30 16:01:49 dhoyt-ha-1 osafdtmd[5862]: NO Lost contact with 'SC-3'
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO AVD down on: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO AMFND down on: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO FM down on: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO IMMD down on: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO Core services went down on
node_id: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO IMMND down on: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO Core services went down on
node_id: 2030f
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO Node Down event for node id 2030f:
Jul 30 16:01:49 dhoyt-ha-1 osafimmd[5938]: NO MDS event from svc_id 24
(change:6, dest:13)
Jul 30 16:01:49 dhoyt-ha-1 osafimmd[5938]: NO MDS event from svc_id 25
(change:4, dest:566312912819188)
Jul 30 16:01:49 dhoyt-ha-1 osafimmd[5938]: WA IMMD lost contact with peer IMMD
(NCSMDS_RED_DOWN)
Jul 30 16:01:49 dhoyt-ha-1 osafrded[5905]: NO Peer down on node 0x2030f
Jul 30 16:01:49 dhoyt-ha-1 osafclmd[6701]: NO Node 131855 went down. Not
sending track callback for agents on that node
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: NO Current role: ACTIVE
Jul 30 16:01:49 dhoyt-ha-1 osafclmd[6701]: NO Node 131855 went down. Not
sending track callback for agents on that node
Jul 30 16:01:49 dhoyt-ha-1 osafclmd[6701]: NO Node 131855 went down. Not
sending track callback for agents on that node
Jul 30 16:01:49 dhoyt-ha-1 osafclmd[6701]: NO Node 131855 went down. Not
sending track callback for agents on that node
Jul 30 16:01:49 dhoyt-ha-1 osaffmd[5921]: Rebooting OpenSAF NodeId = 131855 EE
Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343,
SupervisionTime = 60
Jul 30 16:01:49 dhoyt-ha-1 osafamfd[6718]: NO Node 'SC-3' is down. Start
failover delay timer
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO Recent fevs:
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO
<2935>[IMMND_EVT_A2ND_OI_OBJ_MODIFY -> safSg=ProcessMonitor,safApp=Insight]
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO
<2936>[IMMND_EVT_A2ND_OI_OBJ_MODIFY -> safSg=ProcessMonitor,safApp=Insight]
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO
<2937>[IMMND_EVT_A2ND_ADMO_RELEASE -> admo_id:5]
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO
<2938>[IMMND_EVT_A2ND_ADMO_FINALIZE -> admo_id:5]
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO
<2939>[IMMND_EVT_D2ND_DISCARD_NODE -> node_id:2030f]
Jul 30 16:01:49 dhoyt-ha-1 osafamfd[6718]: NO Start timer for '2030f'
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO Global discard node received for
nodeId:2030f pid:5108
Jul 30 16:01:49 dhoyt-ha-1 osafimmnd[5956]: NO Implementer disconnected 7 <0,
2030f(down)> (@safAmfService2030f)
Jul 30 16:01:49 dhoyt-ha-1 osafamfd[6718]: NO (Consensus::IsWritable): {1}
Calling 'Set'
Jul 30 16:01:49 dhoyt-ha-1 opensaf_reboot: Rebooting remote node in the absence
of PLM is outside the scope of OpenSAF
Jul 30 16:01:49 dhoyt-ha-1 osafamfd[6718]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin set "opensaf_write_test" "SC-1" 0', returning 0
Jul 30 16:01:57 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin watch "takeover_request"', returning 0
Jul 30 16:01:57 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::WatchKeyFunction):
Read 'SC-1 SC-3 1 NEW'
Jul 30 16:01:57 dhoyt-ha-1 osafrded[5905]: NO (Consensus::WriteTakeoverResult):
Found 'SC-1 SC-3 1 NEW'
Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO
(Consensus::HandleTakeoverRequest): Calling 'ParseTakeoverRequest'
Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO (Consensus::WriteTakeoverResult):
Found 'SC-1 SC-3 1 NEW'
Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO
(Consensus::HandleTakeoverRequest): Calling 'WriteTakeoverResult'
Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO TakeoverResult: SC-1 SC-3 1
REJECTED
Jul 30 16:02:01 dhoyt-ha-1 osafrded[5905]: NO (Consensus::WriteTakeoverResult):
Calling 'Set'
Jul 30 16:02:02 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin set_if_prev "takeover_request" "SC-1 SC-3 1
REJECTED" "SC-1 SC-3 1 NEW" 20', returning 0
Jul 30 16:02:02 dhoyt-ha-1 osafrded[5905]: NO
(Consensus::MonitorTakeoverRequest): Calling 'Watch'
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin watch "takeover_request"', returning 0
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::WatchKeyFunction):
Read ''
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Could not read takeover request (7)
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Could not read takeover request (7)
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Could not read takeover request (7)
Jul 30 16:02:24 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Could not read takeover request (7)
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:25 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Could not read takeover request (7)
Jul 30 16:02:29 dhoyt-ha-1 osafamfd[6718]: NO Node failover timeout
Jul 30 16:02:29 dhoyt-ha-1 osafamfd[6718]: NO Completing delayed node failover
for 'SC-3'
Jul 30 16:02:29 dhoyt-ha-1 osafamfd[6718]: NO Node 'SC-3' left the cluster
Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO Empty takeover request from watch
command. Read it again.
Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO
(Consensus::HandleTakeoverRequest): {1} Calling 'ReadTakeoverRequest'
Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:29 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Could not read takeover request (7)
Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO
(Consensus::HandleTakeoverRequest): {2} Calling 'ReadTakeoverRequest'
Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:30 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Could not read takeover request (7)
Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO
(Consensus::HandleTakeoverRequest): {2} Calling 'ReadTakeoverRequest'
Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:31 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Could not read takeover request (7)
Jul 30 16:02:32 dhoyt-ha-1 osafrded[5905]: NO
(Consensus::HandleTakeoverRequest): {2} Calling 'ReadTakeoverRequest'
Jul 30 16:02:32 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: NO (KeyValue::Get): Read: ''
Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: NO (Consensus::ReadTakeoverRequest):
Could not read takeover request (7)
Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: NO Lost connectivity to consensus
service
Jul 30 16:02:33 dhoyt-ha-1 osafrded[5905]: Quick local node rebooting, Reason:
Lost connectivity to consensus service. Rebooting this node
Jul 30 16:02:33 dhoyt-ha-1 opensaf_reboot: Do quick local node reboot
After the takeover request was rejected, it does a MonitorTakeoverRequest and
then does a watch of the takeover_request.
It performs this 10 times and then logs that it loct connectivity to the
consensus service.
Yet, all three etcd servers are still up and running fine.
Any ideas?
Regards,
David
________________________________
Notice: This e-mail together with any attachments may contain information of
Ribbon Communications Inc. that is confidential and/or proprietary for the sole
use of the intended recipient. Any review, disclosure, reliance or distribution
by others or forwarding without express permission is strictly prohibited. If
you are not the intended recipient, please notify the sender immediately and
then delete all copies, including any attachments.
________________________________
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users