Hi David
Can you attach the logs from SC-1?
________________________________
From: Hoyt, David <[email protected]>
Sent: 08 July 2020 07:10
To: [email protected] <[email protected]>;
Gary Lee <[email protected]>
Subject: RE: consensus behaviour
With the FMS_TAKEOVER_REQUEST_VALID_TIME set to 20 seconds, I see the
max_takeover_retry_ threshold is 10, based on the following code within the
consensus.cc file:
takeover_valid_time_ = base::GetEnv("FMS_TAKEOVER_REQUEST_VALID_TIME", 20);
max_takeover_retry_ = takeover_valid_time_ / 2;
However, the loop always cycles just 5 or 6 times, executing the
/opt/opensaf/osaf-etcd3.plugin get "takeover_request" and then ultimately
returns REJECTED.
What code actually sets this “REJECTED” value?
I’ve been searching, but have yet to find the exact code. I’ve added some debug
logs just before I see a REJECTED value being assigned to result within the
consensus code area, but my logs are not getting generated.
I’d like to understand how or why that REJECTION value is being set and then
maybe I need to adjust or tweak some code specific for our use.
As it stands now, if the active node goes down, so does the standby due to this
takeover_request rejection. So, in the end, the opensaf consensus code doesn’t
work for us.
-David
From: Hoyt, David <[email protected]>
Sent: Tuesday, July 7, 2020 6:39 AM
To: [email protected]; Gary Lee <[email protected]>
Subject: Re: consensus behaviour
Yes, the nodes are VMs.
Now as part of the reboot, opensaf stops the application that's also running
active. This takes some time as the app has 21 components.
Is it possible that the shutdown is taking a long time and that SC-2's takeover
request is rejected because opensaf on SC-1 is still in the process of shutting
down the app?
-David
Get Outlook for Android<https://aka.ms/ghei36>
________________________________
From: Gary Lee <[email protected]<mailto:[email protected]>>
Sent: Tuesday, July 7, 2020 4:53:22 AM
To: Hoyt, David <[email protected]<mailto:[email protected]>>;
[email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Subject: Re: consensus behaviour
________________________________
NOTICE: This email was received from an EXTERNAL sender
________________________________
Hi David
I guess SC-1 is a VM so it is able to reboot quickly, and re-assert itself as
the active and reject the takeover request. You could adjust the
TAKEOVER_TIMEOUT value, but there is a lower bound (can't remember what from
the top of my head) because the etcd calls themselves can take some time to
complete.
The TCP arbitrator has a much faster failover response if that's what you need.
Gary
________________________________
From: Hoyt, David <[email protected]<mailto:[email protected]>>
Sent: 07 July 2020 03:44
To: Gary Lee <[email protected]<mailto:[email protected]>>;
[email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Subject: RE: consensus behaviour
Hi Gary,
I was off for most of last week so I’m just getting back to this.
Yes, node SC-1 is rebooted and node SC-2 goes active, but the takeover is
rejected.
I see several ReadTakeoverRequest being performed with the response: 'SC-1 SC-2
1 NEW'
Then finally, one response comes back as 'SC-1 SC-2 1 REJECTED'
Why does the “get takeover_request” response include REJECTED?
How is that determined?
Jul 2 20:57:06 node-sc2 osaffmd[31044]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0
Jul 2 20:57:06 node-sc2 osaffmd[31044]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1
REJECTED'
Jul 2 20:57:06 node-sc2 osaffmd[31044]: NO (Consensus::WriteTakeoverResult):
Found 'SC-1 SC-2 1 REJECTED'
Jul 2 20:57:06 node-sc2 osaffmd[31044]: NO Takeover request rejected
Here are the osaffmd debug logs showing the lead up to the REJECTED response:
Jul 2 20:56:51 node-sc2 osaffmd[31044]: NO Controller Failover: Setting role
to ACTIVE
Jul 2 20:56:51 node-sc2 osaffmd[31044]: NO Starting consensus service
supervision: 20 s
Jul 2 20:56:55 node-sc2 osaffmd[31044]: NO
(Consensus::CheckForExistingTakeoverRequest): {1} Calling 'ReadTakeoverRequest'
Jul 2 20:56:55 node-sc2 osaffmd[31044]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 2 20:56:56 node-sc2 osaffmd[31044]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jul 2 20:56:56 node-sc2 osaffmd[31044]: NO (KeyValue::Get): Read: ''
Jul 2 20:56:56 node-sc2 osaffmd[31044]: NO (Consensus::ReadTakeoverRequest):
Could not read takeover request (7)
Jul 2 20:56:56 node-sc2 osaffmd[31044]: NO (Consensus::PromoteThisNode): {1}
Calling 'Lock'
Jul 2 20:56:58 node-sc2 osaffmd[31044]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin lock "SC-2" 0', returning 1
Jul 2 20:56:58 node-sc2 osaffmd[31044]: NO Locked failed: SC-1
Jul 2 20:56:58 node-sc2 osaffmd[31044]: NO (Consensus::CurrentActive): {1}
Calling 'LockOwner'
Jul 2 20:56:58 node-sc2 osaffmd[31044]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin lock_owner', returning 0
Jul 2 20:56:58 node-sc2 osaffmd[31044]: NO Current active controller is SC-1
Jul 2 20:56:58 node-sc2 osaffmd[31044]: NO (Consensus::CreateTakeoverRequest):
{1} Calling 'Create'
Jul 2 20:56:59 node-sc2 osaffmd[31044]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin create "takeover_request" "SC-1 SC-2 1 NEW"
20', returning 0
Jul 2 20:56:59 node-sc2 osaffmd[31044]: NO (Consensus::CreateTakeoverRequest):
{3} Calling 'ReadTakeoverRequest'
Jul 2 20:56:59 node-sc2 osaffmd[31044]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 2 20:57:00 node-sc2 osaffmd[31044]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0
Jul 2 20:57:00 node-sc2 osaffmd[31044]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1
NEW'
Jul 2 20:57:01 node-sc2 osaffmd[31044]: NO (Consensus::CreateTakeoverRequest):
{3} Calling 'ReadTakeoverRequest'
Jul 2 20:57:01 node-sc2 osaffmd[31044]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 2 20:57:02 node-sc2 osaffmd[31044]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0
Jul 2 20:57:02 node-sc2 osaffmd[31044]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1
NEW'
Jul 2 20:57:03 node-sc2 osaffmd[31044]: NO (Consensus::CreateTakeoverRequest):
{3} Calling 'ReadTakeoverRequest'
Jul 2 20:57:03 node-sc2 osaffmd[31044]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 2 20:57:03 node-sc2 osaffmd[31044]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0
Jul 2 20:57:03 node-sc2 osaffmd[31044]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1
NEW'
Jul 2 20:57:04 node-sc2 osaffmd[31044]: NO (Consensus::CreateTakeoverRequest):
{3} Calling 'ReadTakeoverRequest'
Jul 2 20:57:04 node-sc2 osaffmd[31044]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 2 20:57:05 node-sc2 osaffmd[31044]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0
Jul 2 20:57:05 node-sc2 osaffmd[31044]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1
NEW'
Jul 2 20:57:06 node-sc2 osaffmd[31044]: NO (Consensus::CreateTakeoverRequest):
{3} Calling 'ReadTakeoverRequest'
Jul 2 20:57:06 node-sc2 osaffmd[31044]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jul 2 20:57:06 node-sc2 osaffmd[31044]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0
Jul 2 20:57:06 node-sc2 osaffmd[31044]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1
REJECTED'
Jul 2 20:57:06 node-sc2 osaffmd[31044]: NO (Consensus::WriteTakeoverResult):
Found 'SC-1 SC-2 1 REJECTED'
Jul 2 20:57:06 node-sc2 osaffmd[31044]: NO Takeover request rejected
Jul 2 20:57:06 node-sc2 osaffmd[31044]: WA Takeover request failed (14)
Jul 2 20:57:06 node-sc2 osaffmd[31044]: ER A controller is already active. We
were separated from the cluster?
Regards,
David
From: Gary Lee <[email protected]<mailto:[email protected]>>
Sent: Monday, June 29, 2020 7:45 PM
To: Hoyt, David <[email protected]<mailto:[email protected]>>;
[email protected]<mailto:[email protected]>
Subject: Re: consensus behaviour
________________________________
NOTICE: This email was received from an EXTERNAL sender
________________________________
Hi David
Did you really reboot SC-1, or was it partially firewalled and etcd is still
able to communicate.
Gary
________________________________
From: Hoyt, David <[email protected]<mailto:[email protected]>>
Sent: 30 June 2020 01:46
To:
[email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Cc: Gary Lee <[email protected]<mailto:[email protected]>>
Subject: consensus behaviour
Hi all,
I’m seeing some weird behavior with the opensaf consensus service and just
wanted to get some feedback/explanation from the osaf experts.
Setup:
2 nodes: SC-1,SC-2
Running opensaf-5.19.10
Virtualization: kvm
Operating System: Red Hat Enterprise Linux Server 7.8 (Maipo)
Kernel: Linux 3.10.0-1127.el7.x86_64
Architecture: x86-64
The consenus service is provided by 3 etcd processes running on 3 separate
nodes. Opensaf is NOT co-located with any of the etcd processes.
As a result, I’m making use of the etcd3.plugin file that comes with the
opensaf source code distribution.
I found that opensaf initially establishes a connection with etcd and if I
perform a si-swap of the opensaf controllers, I see that etcd gets updated
appropriately with the name of the active opensaf controller.
The problem I’m seeing is if I reboot the node with the active opensaf
controller, the logs show the standby controller detects the loss of the active
and, as expected, the standby attempts to go active. Now, as part of this, I
see logs showing interaction with etcd. However, the ‘takeover’ request fails
and so the standby controller also goes for a reboot.
This is very consistent and reproducible.
Here’s a snapshot of the logs from the original standby showing the interaction
with etcd:
NOTE:
(1) the etcd3.plugin file has been renamed as: /opt/opensaf/osaf-etcd3.plugin
(2) I created an osaf patch for osaffmd that simply has some consensus logs
always turned “on”.
Jun 28 19:39:53 node-sc2 osafdtmd[11290]: ER recv() from node 0x2010f failed,
errno=110
Jun 28 19:39:53 node-sc2 osafdtmd[11290]: NO Lost contact with 'SC-1'
Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO AVD down on: 2010f
Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO AMFND down on: 2010f
Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO FM down on: 2010f
Jun 28 19:39:53 node-sc2 osafamfd[11454]: NO Node 'SC-1' is down. Start
failover delay timer
Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO IMMD down on: 2010f
Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO IMMND down on: 2010f
Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO Core services went down on node_id:
2010f
Jun 28 19:39:53 node-sc2 osafimmd[11367]: NO MDS event from svc_id 24
(change:1, dest:13)
Jun 28 19:39:53 node-sc2 osafimmd[11367]: NO MDS event from svc_id 24
(change:6, dest:13)
Jun 28 19:39:53 node-sc2 osafimmd[11367]: NO MDS event from svc_id 25
(change:4, dest:564113889574248)
Jun 28 19:39:53 node-sc2 osafamfd[11454]: NO Start timer for '2010f'
Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO Node Down event for node id 2010f:
Jun 28 19:39:53 node-sc2 osafrded[11334]: NO Peer down on node 0x2010f
Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO Current role: STANDBY
Jun 28 19:39:53 node-sc2 osaffmd[11350]: Rebooting OpenSAF NodeId = 131343 EE
Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131855,
SupervisionTime = 60
Jun 28 19:39:53 node-sc2 osafimmd[11367]: WA IMMD lost contact with peer IMMD
(NCSMDS_RED_DOWN)
Jun 28 19:39:53 node-sc2 osafimmd[11367]: WA IMMND DOWN on active controller
2010f detected at standby immd!! 2020f. Possible failover
Jun 28 19:39:53 node-sc2 osafimmd[11367]: NO Skipping re-send of fevs message
2776 since it has recently been resent.
Jun 28 19:39:53 node-sc2 osafimmd[11367]: NO Skipping re-send of fevs message
2777 since it has recently been resent.
Jun 28 19:39:53 node-sc2 osafimmnd[11385]: WA DISCARD DUPLICATE FEVS
message:2776
Jun 28 19:39:53 node-sc2 osafimmnd[11385]: WA Error code 2 returned for message
type 82 - ignoring
Jun 28 19:39:53 node-sc2 osafimmnd[11385]: WA DISCARD DUPLICATE FEVS
message:2777
Jun 28 19:39:53 node-sc2 osafimmnd[11385]: WA Error code 2 returned for message
type 82 - ignoring
Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO Controller Failover: Setting role
to ACTIVE
Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO Starting consensus service
supervision: 20 s
Jun 28 19:39:57 node-sc2 osaffmd[11350]: NO
(Consensus::CheckForExistingTakeoverRequest): {1} Calling 'ReadTakeoverRequest'
Jun 28 19:39:57 node-sc2 osaffmd[11350]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jun 28 19:39:57 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jun 28 19:39:57 node-sc2 osaffmd[11350]: NO (KeyValue::Get): Read: ''
Jun 28 19:39:57 node-sc2 osaffmd[11350]: NO (Consensus::ReadTakeoverRequest):
Could not read takeover request (7)
Jun 28 19:39:57 node-sc2 osaffmd[11350]: NO (Consensus::PromoteThisNode): {1}
Calling 'Lock'
Jun 28 19:39:59 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin lock "SC-2" 0', returning 1
Jun 28 19:39:59 node-sc2 osaffmd[11350]: NO Locked failed: SC-1
Jun 28 19:39:59 node-sc2 osaffmd[11350]: NO (Consensus::CurrentActive): {1}
Calling 'LockOwner'
Jun 28 19:40:00 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin lock_owner', returning 0
Jun 28 19:40:00 node-sc2 osaffmd[11350]: NO Current active controller is SC-1
Jun 28 19:40:00 node-sc2 osaffmd[11350]: NO (Consensus::CreateTakeoverRequest):
{1} Calling 'Create'
Jun 28 19:40:01 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin create "takeover_request" "SC-1 SC-2 1 NEW"
20', returning 0
Jun 28 19:40:01 node-sc2 osaffmd[11350]: NO (Consensus::CreateTakeoverRequest):
{3} Calling 'ReadTakeoverRequest'
Jun 28 19:40:01 node-sc2 osaffmd[11350]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jun 28 19:40:01 node-sc2 osafimmnd[11385]: WA MDS Send Failed to service:IMMND
rc:2
Jun 28 19:40:01 node-sc2 osafimmnd[11385]: ER ERR_NO_RESOURCES: SearchNext -
Problem in sending to peer IMMND over MDS. Aborting searchNext.
Jun 28 19:40:01 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0
Jun 28 19:40:01 node-sc2 osaffmd[11350]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1
NEW'
Jun 28 19:40:01 node-sc2 osaffmd[11350]: NO (Consensus::WriteTakeoverResult):
Found 'SC-1 SC-2 1 NEW'
Jun 28 19:40:02 node-sc2 osaffmd[11350]: NO (Consensus::CreateTakeoverRequest):
{3} Calling 'ReadTakeoverRequest'
Jun 28 19:40:02 node-sc2 osaffmd[11350]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jun 28 19:40:03 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0
Jun 28 19:40:03 node-sc2 osaffmd[11350]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1
NEW'
Jun 28 19:40:03 node-sc2 osaffmd[11350]: NO (Consensus::WriteTakeoverResult):
Found 'SC-1 SC-2 1 NEW'
Jun 28 19:40:04 node-sc2 osaffmd[11350]: NO (Consensus::CreateTakeoverRequest):
{3} Calling 'ReadTakeoverRequest'
Jun 28 19:40:04 node-sc2 osaffmd[11350]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jun 28 19:40:05 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0
Jun 28 19:40:05 node-sc2 osaffmd[11350]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1
NEW'
Jun 28 19:40:05 node-sc2 osaffmd[11350]: NO (Consensus::WriteTakeoverResult):
Found 'SC-1 SC-2 1 NEW'
Jun 28 19:40:06 node-sc2 osaffmd[11350]: NO (Consensus::CreateTakeoverRequest):
{3} Calling 'ReadTakeoverRequest'
Jun 28 19:40:06 node-sc2 osaffmd[11350]: NO (Consensus::ReadTakeoverRequest):
Calling 'Get'
Jun 28 19:40:06 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0
Jun 28 19:40:06 node-sc2 osaffmd[11350]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1
REJECTED'
Jun 28 19:40:06 node-sc2 osaffmd[11350]: NO (Consensus::WriteTakeoverResult):
Found 'SC-1 SC-2 1 REJECTED'
Jun 28 19:40:06 node-sc2 osaffmd[11350]: NO Takeover request rejected
Jun 28 19:40:06 node-sc2 osaffmd[11350]: WA Takeover request failed (14)
Jun 28 19:40:06 node-sc2 osaffmd[11350]: ER A controller is already active. We
were separated from the cluster?
Jun 28 19:40:06 node-sc2 osaffmd[11350]: Quick local node rebooting, Reason: A
controller is already active. We were separated from the cluster?
Regards,
David
________________________________
Notice: This e-mail together with any attachments may contain information of
Ribbon Communications Inc. that is confidential and/or proprietary for the sole
use of the intended recipient. Any review, disclosure, reliance or distribution
by others or forwarding without express permission is strictly prohibited. If
you are not the intended recipient, please notify the sender immediately and
then delete all copies, including any attachments.
________________________________
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users