Yes I did.

Now, I’ve added more debug logs to the consensus.cc file and when the first SC 
starts up, the consensus service is established.

Because we are using the same 3 etcd services for several labs, we made the 
keyname unique so that there would be no collision.
As a result, the keyname for my lab is: dhoyt-gr_osaf_consensus_lock

After SC-1 is up, I checked the status with etcd and the following result was 
returned:
~ ]# ETCDCTL_API=3 /bin/etcdctl 
--endpoints=172.29.85.13:2379,172.29.85.31:2379,172.29.85.18:2379 get 
--from-key "dhoyt-gr_osaf_consensus_lock"
dhoyt-gr_osaf_consensus_lock
SC-1
opensaf_write_test
SC-1

I was surprised to see the key-value pair “opensaf_write_test:SC-1” returned.

Here is the log trace (the logs I added are those with “[D]”).
Just wondering why the next to last log is setting the “opensaf_write_test”?
Shouldn’t that be the actual keyname?

Jun 29 22:42:21 node-sc1 osafrded[8317]: NO Requesting ACTIVE role
Jun 29 22:42:21 node-sc1 osafrded[8317]: NO RDE role set to Undefined
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (Consensus::IsEnabled): ENTERED
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (Consensus::IsWritable): ENTERED
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (Consensus::IsWritable): {1} 
Calling 'Set'
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (KeyValue::Set1): ENTERED
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (KeyValue::Set1): Executing 
command: '/opt/opensaf/osaf-etcd3.plugin set "opensaf_write_test" "SC-1" 0'
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (KeyValue::Execute): ENTERED
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin set "opensaf_write_test" "SC-1" 0', returning 0
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (Consensus::PromoteThisNode): 
ENTERED
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] 
(Consensus::CheckForExistingTakeoverRequest): ENTERED
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] 
(Consensus::CheckForExistingTakeoverRequest): {1} Calling 'ReadTakeoverRequest'
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] 
(Consensus::ReadTakeoverRequest): ENTERED
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] 
(Consensus::ReadTakeoverRequest): Calling 'Get'
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (KeyValue::Get): ENTERED
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (KeyValue::Get): Executing 
command: '/opt/opensaf/osaf-etcd3.plugin get "takeover_request"'
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (KeyValue::Execute): ENTERED
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (KeyValue::Get): Read ''
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (Consensus::PromoteThisNode): 
{1} Calling 'Lock'
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (KeyValue::Lock): ENTERED
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (KeyValue::Lock): Executing 
command: '/opt/opensaf/osaf-etcd3.plugin lock "SC-1" 0'
Jun 29 22:42:23 node-sc1 osafrded[8317]: NO [D] (KeyValue::Execute): ENTERED
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin lock "SC-1" 0', returning 0
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO Active controller set to SC-1
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO Running 
'/usr/lib64/opensaf/opensaf_sc_active' with 0 argument(s)
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO Switched to ACTIVE from Undefined
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (Consensus::IsEnabled): ENTERED
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (Consensus::MonitorLock): 
ENTERED, Calling 'WatchLock'
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (Consensus::IsEnabled): ENTERED
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] 
(Consensus::MonitorTakeoverRequest): ENTERED
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] 
(Consensus::MonitorTakeoverRequest): Calling 'Watch'
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (KeyValue::WatchLockFunction): 
ENTERED
Jun 29 22:42:24 node-sc1 osaffmd[8333]: NO Controller promoted. Stop 
supervision timer
Jun 29 22:42:24 node-sc1 osaffmd[8333]: NO Starting activation supervision: 
300000ms
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (Consensus::IsEnabled): ENTERED
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (KeyValue::WatchLockFunction): 
{1} Executing command: '/opt/opensaf/osaf-etcd3.plugin watch_lock'
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (KeyValue::Execute): ENTERED
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (KeyValue::WatchKeyFunction): 
ENTERED
Jun 29 22:42:24 node-sc1 osafimmnd[8368]: NO IMMD service is UP ... 
ScAbsenseAllowed?:0 introduced?:0
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (Consensus::IsWritable): ENTERED
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (Consensus::IsWritable): {1} 
Calling 'Set'
Jun 29 22:42:24 node-sc1 osafimmd[8350]: NO MDS event from svc_id 24 (change:3, 
dest:13)
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (KeyValue::Set1): ENTERED
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (KeyValue::Set1): Executing 
command: '/opt/opensaf/osaf-etcd3.plugin set "opensaf_write_test" "SC-1" 0'
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (KeyValue::Execute): ENTERED
Jun 29 22:42:24 node-sc1 osafimmd[8350]: NO MDS event from svc_id 24 (change:5, 
dest:13)
Jun 29 22:42:24 node-sc1 osafimmd[8350]: NO MDS event from svc_id 25 (change:3, 
dest:564113889566896)
Jun 29 22:42:24 node-sc1 osafimmd[8350]: NO Waiting 3 seconds to allow IMMND 
MDS attachments to get processed.
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (KeyValue::WatchKeyFunction): 
{1} Executing command: '/opt/opensaf/osaf-etcd3.plugin watch "takeover_request"'
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (KeyValue::Execute): ENTERED
Jun 29 22:42:24 node-sc1 osafimmnd[8368]: NO SERVER STATE: IMM_SERVER_ANONYMOUS 
--> IMM_SERVER_CLUSTER_WAITING
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO [D] (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin set "opensaf_write_test" "SC-1" 0', returning 0
Jun 29 22:42:24 node-sc1 osafrded[8317]: NO Connectivity to consensus service 
established


-dh

From: Gary Lee <gary....@dektech.com.au>
Sent: Monday, June 29, 2020 7:45 PM
To: Hoyt, David <dh...@rbbn.com>; opensaf-users@lists.sourceforge.net
Subject: Re: consensus behaviour

________________________________
NOTICE: This email was received from an EXTERNAL sender
________________________________

Hi David

Did you really reboot SC-1, or was it partially firewalled and etcd is still 
able to communicate.

Gary
________________________________
From: Hoyt, David <dh...@rbbn.com<mailto:dh...@rbbn.com>>
Sent: 30 June 2020 01:46
To: 
opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net> 
<opensaf-users@lists.sourceforge.net<mailto:opensaf-users@lists.sourceforge.net>>
Cc: Gary Lee <gary....@dektech.com.au<mailto:gary....@dektech.com.au>>
Subject: consensus behaviour


Hi all,



I’m seeing some weird behavior with the opensaf consensus service and just 
wanted to get some feedback/explanation from the osaf experts.



Setup:

2 nodes: SC-1,SC-2

Running opensaf-5.19.10



Virtualization: kvm

Operating System: Red Hat Enterprise Linux Server 7.8 (Maipo)

Kernel: Linux 3.10.0-1127.el7.x86_64

Architecture: x86-64





The consenus service is provided by 3 etcd processes running on 3 separate 
nodes. Opensaf is NOT co-located with any of the etcd processes.

As a result, I’m making use of the etcd3.plugin file that comes with the 
opensaf source code distribution.



I found that opensaf initially establishes a connection with etcd and if I 
perform a si-swap of the opensaf controllers, I see that etcd gets updated 
appropriately with the name of the active opensaf controller.

The problem I’m seeing is if I reboot the node with the active opensaf 
controller, the logs show the standby controller detects the loss of the active 
and, as expected, the standby attempts to go active. Now, as part of this, I 
see logs showing interaction with etcd. However, the ‘takeover’ request fails 
and so the standby controller also goes for a reboot.

This is very consistent and reproducible.



Here’s a snapshot of the logs from the original standby showing the interaction 
with etcd:

NOTE:

(1)  the etcd3.plugin file has been renamed as: /opt/opensaf/osaf-etcd3.plugin

(2)  I created an osaf patch for osaffmd that simply has some consensus logs 
always turned “on”.



Jun 28 19:39:53 node-sc2 osafdtmd[11290]: ER recv() from node 0x2010f failed, 
errno=110

Jun 28 19:39:53 node-sc2 osafdtmd[11290]: NO Lost contact with 'SC-1'

Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO AVD down on: 2010f

Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO AMFND down on: 2010f

Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO FM down on: 2010f

Jun 28 19:39:53 node-sc2 osafamfd[11454]: NO Node 'SC-1' is down. Start 
failover delay timer

Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO IMMD down on: 2010f

Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO IMMND down on: 2010f

Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO Core services went down on node_id: 
2010f

Jun 28 19:39:53 node-sc2 osafimmd[11367]: NO MDS event from svc_id 24 
(change:1, dest:13)

Jun 28 19:39:53 node-sc2 osafimmd[11367]: NO MDS event from svc_id 24 
(change:6, dest:13)

Jun 28 19:39:53 node-sc2 osafimmd[11367]: NO MDS event from svc_id 25 
(change:4, dest:564113889574248)

Jun 28 19:39:53 node-sc2 osafamfd[11454]: NO Start timer for '2010f'

Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO Node Down event for node id 2010f:

Jun 28 19:39:53 node-sc2 osafrded[11334]: NO Peer down on node 0x2010f

Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO Current role: STANDBY

Jun 28 19:39:53 node-sc2 osaffmd[11350]: Rebooting OpenSAF NodeId = 131343 EE 
Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131855, 
SupervisionTime = 60

Jun 28 19:39:53 node-sc2 osafimmd[11367]: WA IMMD lost contact with peer IMMD 
(NCSMDS_RED_DOWN)

Jun 28 19:39:53 node-sc2 osafimmd[11367]: WA IMMND DOWN on active controller 
2010f detected at standby immd!! 2020f. Possible failover

Jun 28 19:39:53 node-sc2 osafimmd[11367]: NO Skipping re-send of fevs message 
2776 since it has recently been resent.

Jun 28 19:39:53 node-sc2 osafimmd[11367]: NO Skipping re-send of fevs message 
2777 since it has recently been resent.

Jun 28 19:39:53 node-sc2 osafimmnd[11385]: WA DISCARD DUPLICATE FEVS 
message:2776

Jun 28 19:39:53 node-sc2 osafimmnd[11385]: WA Error code 2 returned for message 
type 82 - ignoring

Jun 28 19:39:53 node-sc2 osafimmnd[11385]: WA DISCARD DUPLICATE FEVS 
message:2777

Jun 28 19:39:53 node-sc2 osafimmnd[11385]: WA Error code 2 returned for message 
type 82 - ignoring

Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO Controller Failover: Setting role 
to ACTIVE

Jun 28 19:39:53 node-sc2 osaffmd[11350]: NO Starting consensus service 
supervision: 20 s

Jun 28 19:39:57 node-sc2 osaffmd[11350]: NO 
(Consensus::CheckForExistingTakeoverRequest): {1} Calling 'ReadTakeoverRequest'

Jun 28 19:39:57 node-sc2 osaffmd[11350]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'

Jun 28 19:39:57 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 1

Jun 28 19:39:57 node-sc2 osaffmd[11350]: NO (KeyValue::Get): Read: ''

Jun 28 19:39:57 node-sc2 osaffmd[11350]: NO (Consensus::ReadTakeoverRequest): 
Could not read takeover request (7)

Jun 28 19:39:57 node-sc2 osaffmd[11350]: NO (Consensus::PromoteThisNode): {1} 
Calling 'Lock'

Jun 28 19:39:59 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin lock "SC-2" 0', returning 1

Jun 28 19:39:59 node-sc2 osaffmd[11350]: NO Locked failed: SC-1

Jun 28 19:39:59 node-sc2 osaffmd[11350]: NO (Consensus::CurrentActive): {1} 
Calling 'LockOwner'

Jun 28 19:40:00 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin lock_owner', returning 0

Jun 28 19:40:00 node-sc2 osaffmd[11350]: NO Current active controller is SC-1

Jun 28 19:40:00 node-sc2 osaffmd[11350]: NO (Consensus::CreateTakeoverRequest): 
{1} Calling 'Create'

Jun 28 19:40:01 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin create "takeover_request" "SC-1 SC-2 1 NEW" 
20', returning 0

Jun 28 19:40:01 node-sc2 osaffmd[11350]: NO (Consensus::CreateTakeoverRequest): 
{3} Calling 'ReadTakeoverRequest'

Jun 28 19:40:01 node-sc2 osaffmd[11350]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'

Jun 28 19:40:01 node-sc2 osafimmnd[11385]: WA MDS Send Failed to service:IMMND 
rc:2

Jun 28 19:40:01 node-sc2 osafimmnd[11385]: ER ERR_NO_RESOURCES: SearchNext - 
Problem in sending to peer IMMND over MDS. Aborting searchNext.

Jun 28 19:40:01 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0

Jun 28 19:40:01 node-sc2 osaffmd[11350]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1 
NEW'

Jun 28 19:40:01 node-sc2 osaffmd[11350]: NO (Consensus::WriteTakeoverResult): 
Found 'SC-1 SC-2 1 NEW'

Jun 28 19:40:02 node-sc2 osaffmd[11350]: NO (Consensus::CreateTakeoverRequest): 
{3} Calling 'ReadTakeoverRequest'

Jun 28 19:40:02 node-sc2 osaffmd[11350]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'

Jun 28 19:40:03 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0

Jun 28 19:40:03 node-sc2 osaffmd[11350]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1 
NEW'

Jun 28 19:40:03 node-sc2 osaffmd[11350]: NO (Consensus::WriteTakeoverResult): 
Found 'SC-1 SC-2 1 NEW'

Jun 28 19:40:04 node-sc2 osaffmd[11350]: NO (Consensus::CreateTakeoverRequest): 
{3} Calling 'ReadTakeoverRequest'

Jun 28 19:40:04 node-sc2 osaffmd[11350]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'

Jun 28 19:40:05 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0

Jun 28 19:40:05 node-sc2 osaffmd[11350]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1 
NEW'

Jun 28 19:40:05 node-sc2 osaffmd[11350]: NO (Consensus::WriteTakeoverResult): 
Found 'SC-1 SC-2 1 NEW'

Jun 28 19:40:06 node-sc2 osaffmd[11350]: NO (Consensus::CreateTakeoverRequest): 
{3} Calling 'ReadTakeoverRequest'

Jun 28 19:40:06 node-sc2 osaffmd[11350]: NO (Consensus::ReadTakeoverRequest): 
Calling 'Get'

Jun 28 19:40:06 node-sc2 osaffmd[11350]: NO (KeyValue::Execute): Executed 
'/opt/opensaf/osaf-etcd3.plugin get "takeover_request"', returning 0

Jun 28 19:40:06 node-sc2 osaffmd[11350]: NO (KeyValue::Get): Read: 'SC-1 SC-2 1 
REJECTED'

Jun 28 19:40:06 node-sc2 osaffmd[11350]: NO (Consensus::WriteTakeoverResult): 
Found 'SC-1 SC-2 1 REJECTED'

Jun 28 19:40:06 node-sc2 osaffmd[11350]: NO Takeover request rejected

Jun 28 19:40:06 node-sc2 osaffmd[11350]: WA Takeover request failed (14)

Jun 28 19:40:06 node-sc2 osaffmd[11350]: ER A controller is already active. We 
were separated from the cluster?

Jun 28 19:40:06 node-sc2 osaffmd[11350]: Quick local node rebooting, Reason: A 
controller is already active. We were separated from the cluster?





Regards,

David



________________________________
Notice: This e-mail together with any attachments may contain information of 
Ribbon Communications Inc. that is confidential and/or proprietary for the sole 
use of the intended recipient. Any review, disclosure, reliance or distribution 
by others or forwarding without express permission is strictly prohibited. If 
you are not the intended recipient, please notify the sender immediately and 
then delete all copies, including any attachments.
________________________________

_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to