Re: [controller-dev] Need Input on Geo Cluster Behavior for Node Isolation/Un-isolation

Chethana Lakshmanappa Mon, 05 Mar 2018 16:39:54 -0800

Hi Tom,

Sorry for the delayed response. Please find below details for the questions 
that you have asked.


Also, I am able to recreate the issue in 3 node cluster but the behavior was 
little different from geo-cluster. In 3 node cluster say Node A, B & C with 
Node A as cluster leader and Node B is the shard leader for all shards. When 
Node B was isolated and rejoined, Node B did not join the cluster and was in 
“Isolated Leader” state. All 3 nodes displayed "quarantined address is still 
unreachable or has not been restarted” for close to 50 mins after which Node B 
was able to rejoin back. I have created upstream ticket - 
https://jira.opendaylight.org/browse/CONTROLLER-1817

Thanks,
Chethana
> On Feb 27, 2018, at 8:13 PM, Tom Pantelis <tompante...@gmail.com> wrote:
> 
> 
> 
> On Fri, Feb 16, 2018 at 12:42 AM, Chethana Lakshmanappa 
> <cheth...@luminanetworks.com <mailto:cheth...@luminanetworks.com>> wrote:
> Hi All,
> 
> Kindly need your input on some of the behavior seen in Geo cluster setup when 
> a node is isolated and un-isolated.
> 
> Suppose Geo cluster has nodes A, B and C residing in one primary data center 
> which is voting and D, E & F residing in secondary data center which is 
> non-voting:
> If a node is Isolated, let's say Node B, then immediately in the cluster all 
> nodes are unreachable to each other.
> 
> That is odd. How do you know that all nodes became unreachable to each other? 
> The log excerpt below just indicates that 10.18.130.105 <> lost reachability 
> with 10.18.130.103 (Node B I assume)  <>which is expected. The message 
> "Leader can currently not perform its duties" means that the akka cluster 
> leader cannot allow new nodes to be added to the cluster or nodes removed 
> until the lost node comes back or is downed. 
[Chethana] I was checking 
"http://{{controller-ip}}:{{restconf-port}}/jolokia/read/akka:type=Cluster”API. 

When Node 2 is isolated, it has below message which is correct:
    "Leader": "akka.tcp://opendaylight-cluster-data@10.18.130.120:2550",
        "Unreachable": 
"akka.tcp://opendaylight-cluster-data@10.18.130.32:2550,akka.tcp://opendaylight-cluster-data@10.18.131.43:2550,akka.tcp://opendaylight-cluster-data@10.18.131.41:2550,akka.tcp://opendaylight-cluster-data@10.18.130.73:2550,akka.tcp://opendaylight-cluster-data@10.18.131.42:2550",
Node 1, this is correct as only Node 2 is down:
        "Leader": "akka.tcp://opendaylight-cluster-data@10.18.130.32:2550",
        "Unreachable": 
"akka.tcp://opendaylight-cluster-data@10.18.130.120:2550",
Node 3, here both Node 1 and Node 2 are unreachable:
        "Leader": "akka.tcp://opendaylight-cluster-data@10.18.130.32:2550",
        "Unreachable": 
"akka.tcp://opendaylight-cluster-data@10.18.130.73:2550,akka.tcp://opendaylight-cluster-data@10.18.130.120:2550",
Node 4, here all primary nodes are unreachable. In some runs even secondary 
nodes also will be listed as unreachable:
        "Leader": "akka.tcp://opendaylight-cluster-data@10.18.131.41:2550",
        "Unreachable": 
"akka.tcp://opendaylight-cluster-data@10.18.130.73:2550,akka.tcp://opendaylight-cluster-data@10.18.130.120:2550,akka.tcp://opendaylight-cluster-data@10.18.130.32:2550",
Node 5, here all primary nodes are unreachable. In some runs even secondary 
nodes also will be listed as unreachable:
        "Leader": "akka.tcp://opendaylight-cluster-data@10.18.131.41:2550",
        "Unreachable": 
"akka.tcp://opendaylight-cluster-data@10.18.130.73:2550,akka.tcp://opendaylight-cluster-data@10.18.130.120:2550,akka.tcp://opendaylight-cluster-data@10.18.130.32:2550",
Node 6,  here all primary nodes are unreachable. In some runs even secondary 
nodes also will be listed as unreachable:
        "Leader": "akka.tcp://opendaylight-cluster-data@10.18.131.41:2550",
        "Unreachable": 
"akka.tcp://opendaylight-cluster-data@10.18.130.73:2550,akka.tcp://opendaylight-cluster-data@10.18.130.120:2550,akka.tcp://opendaylight-cluster-data@10.18.130.32:2550",

> All nodes wait for a threshold amount of time before making Node B as 
> quarantined and then reachability within the cluster is restored.
> 
> What is the threshold amount of time it needs to wait?
> If the node goes down or stopped, this behavior is not seen. It is seen only 
> when it is isolated. How is this different from node down?
> 
> Log excerpt from Node A when Node B is isolated:
> 130.103:2550] has failed, address is now gated for [5000] ms. Reason: 
> [Disassociated] 
> 2018-02-15 19:53:56,109 | INFO  | lt-dispatcher-22 | 
> kka://opendaylight-cluster-data <>) | 113 - com.typesafe.akka.slf4j - 2.4.18 
> | Cluster Node [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 <>] - 
> Leader can currently not perform its duties, reachability status: 
> [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 <> -> 
> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: <>Unreachable 
> [Unreachable] (1)], member status: 
> [akka.tcp://opendaylight-cluster-data@10.18.130.103:2550 <> Up seen=false, 
> akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 <> Up seen=true, 
> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550 <> Up seen=true, 
> akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 <> Up seen=true, 
> akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 <> Up seen=true, 
> akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 <> Up seen=true]
> 
> If a Shard Leader is Isolated, let’s say you make Node A as shard leader for 
> all shards and data store. On isolating and un-isolating Node A, I see the 
> following:
> 
> Primary voting nodes are unreachable to secondary nodes and vice versa. 
> Cluster never recovers and all nodes need to be restarted to have cluster 
> working. Is this a bug?
> Also the isolated node which is un-isolated is unreachable to primary voting 
> nodes and never recovers.
> 
> It may be that, on un-isolation, split brain occurred in akka with 2 cluster 
> leaders. I assume that Node A was the akka cluster leader when it was 
> isolated - it would be interesting to see if this also occurs if a 
> non-cluster leader node is isolated.
> 
> Also make sure you do not have the auto-down-unreachable-after option enabled 
> in the akka.conf. 
[Chethana] I tried with non-cluster leader node and I see the same behavior. 
>  
> 
> Log excerpt:
> 2018-02-15 19:32:47,174 | INFO  | lt-dispatcher-19 | 
> kka://opendaylight-cluster-data <>) | 113 - com.typesafe.akka.slf4j - 2.4.18 
> | Cluster Node [akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 <>] - 
> Leader can currently not perform its duties, reachability status: 
> [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 <> -> 
> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: <> Unreachable 
> [Terminated] (1), akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 <> 
> -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: <> Unreachable 
> [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 <> 
> -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: <> Terminated 
> [Terminated] (4), akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 <> 
> -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: <> Unreachable 
> [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 <> 
> -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: <> Unreachable 
> [Terminated] (3), akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 <> 
> -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: <> Unreachable 
> [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 <> 
> -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: <> Unreachable 
> [Terminated] (3), akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 <> 
> -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: <> Unreachable 
> [Unreachable] (2)], member status: 
> [akka.tcp://opendaylight-cluster-data@10.18.130.103:2550 <> Down seen=false, 
> akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 <> WeaklyUp 
> seen=true, akka.tcp://opendaylight-cluster-data@10.18.130.84:2550 <> Up 
> seen=false, akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 <> Up 
> seen=true, akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 <> Up 
> seen=true, akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 <> Up 
> seen=true]
> 
> If a Cluster Leader is Isolated, then "DataStoreUnavailableException: Shard 
> member-2-shard-default-config currently has no leader” exception is seen on 
> nodes where COMMIT fails:
> 
> Transactions done during this threshold time fail as there is no leader. Is 
> this acceptable? (as threshold time sometimes is very long)
> Transactions will fail if there is no shard leader although it does make 
> every attempt with timeouts and retries. But at some point it gives up.
[Chethana] Is this expected then?
> 
> Also when the Isolated node is un-isolated, sometimes cluster does not 
> recover and all nodes need to be restarted. Is this a bug?
[Chethana] I see sometimes that cluster does not recover after Cluster leader 
is isolated and rejoined, only option to restart all the nodes. How to resolve 
this?
> 
> Log excerpt on Node F:
> 2018-02-15 19:54:12,625 | ERROR | a-change-notif-0 | MdSalHelper              
>         | 96 - com.luminanetworks.lsc.app.lsc-app-nodecounter-impl - 
> 1.0.0.SNAPSHOT | DataStore Tx encountered error
> TransactionCommitFailedException{message=canCommit encountered an unexpected 
> failure, errorList=[RpcError [message=canCommit encountered an unexpected 
> failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, 
> applicationTag=null, info=null, 
> cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException:
>  Shard member-6-shard-default-operational currently has no leader. Try again 
> later.]]}
> 
> If a follower is isolated and un-isolated, shard leader is re-elected. 
> Cluster already had a shard leader, so, should re-election happen?
> It can happen if the follower is able to send out a RequestVote after 
> un-isolation.  From the follower's perspective there is no leader so it tries 
> to become leader - this is the way RAFT works.
> 
>  
> 
> Thanks,
> Chethana
> 
> 
> 
> _______________________________________________
> controller-dev mailing list
> controller-dev@lists.opendaylight.org 
> <mailto:controller-dev@lists.opendaylight.org>
> https://lists.opendaylight.org/mailman/listinfo/controller-dev 
> <https://lists.opendaylight.org/mailman/listinfo/controller-dev>

_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Re: [controller-dev] Need Input on Geo Cluster Behavior for Node Isolation/Un-isolation

Reply via email to