Hi Tom, Sorry for the delayed response. Please find below details for the questions that you have asked.
Also, I am able to recreate the issue in 3 node cluster but the behavior was little different from geo-cluster. In 3 node cluster say Node A, B & C with Node A as cluster leader and Node B is the shard leader for all shards. When Node B was isolated and rejoined, Node B did not join the cluster and was in “Isolated Leader” state. All 3 nodes displayed "quarantined address is still unreachable or has not been restarted” for close to 50 mins after which Node B was able to rejoin back. I have created upstream ticket - https://jira.opendaylight.org/browse/CONTROLLER-1817 Thanks, Chethana > On Feb 27, 2018, at 8:13 PM, Tom Pantelis <tompante...@gmail.com> wrote: > > > > On Fri, Feb 16, 2018 at 12:42 AM, Chethana Lakshmanappa > <cheth...@luminanetworks.com <mailto:cheth...@luminanetworks.com>> wrote: > Hi All, > > Kindly need your input on some of the behavior seen in Geo cluster setup when > a node is isolated and un-isolated. > > Suppose Geo cluster has nodes A, B and C residing in one primary data center > which is voting and D, E & F residing in secondary data center which is > non-voting: > If a node is Isolated, let's say Node B, then immediately in the cluster all > nodes are unreachable to each other. > > That is odd. How do you know that all nodes became unreachable to each other? > The log excerpt below just indicates that 10.18.130.105 <> lost reachability > with 10.18.130.103 (Node B I assume) <>which is expected. The message > "Leader can currently not perform its duties" means that the akka cluster > leader cannot allow new nodes to be added to the cluster or nodes removed > until the lost node comes back or is downed. [Chethana] I was checking "http://{{controller-ip}}:{{restconf-port}}/jolokia/read/akka:type=Cluster”API. When Node 2 is isolated, it has below message which is correct: "Leader": "akka.tcp://opendaylight-cluster-data@10.18.130.120:2550", "Unreachable": "akka.tcp://opendaylight-cluster-data@10.18.130.32:2550,akka.tcp://opendaylight-cluster-data@10.18.131.43:2550,akka.tcp://opendaylight-cluster-data@10.18.131.41:2550,akka.tcp://opendaylight-cluster-data@10.18.130.73:2550,akka.tcp://opendaylight-cluster-data@10.18.131.42:2550", Node 1, this is correct as only Node 2 is down: "Leader": "akka.tcp://opendaylight-cluster-data@10.18.130.32:2550", "Unreachable": "akka.tcp://opendaylight-cluster-data@10.18.130.120:2550", Node 3, here both Node 1 and Node 2 are unreachable: "Leader": "akka.tcp://opendaylight-cluster-data@10.18.130.32:2550", "Unreachable": "akka.tcp://opendaylight-cluster-data@10.18.130.73:2550,akka.tcp://opendaylight-cluster-data@10.18.130.120:2550", Node 4, here all primary nodes are unreachable. In some runs even secondary nodes also will be listed as unreachable: "Leader": "akka.tcp://opendaylight-cluster-data@10.18.131.41:2550", "Unreachable": "akka.tcp://opendaylight-cluster-data@10.18.130.73:2550,akka.tcp://opendaylight-cluster-data@10.18.130.120:2550,akka.tcp://opendaylight-cluster-data@10.18.130.32:2550", Node 5, here all primary nodes are unreachable. In some runs even secondary nodes also will be listed as unreachable: "Leader": "akka.tcp://opendaylight-cluster-data@10.18.131.41:2550", "Unreachable": "akka.tcp://opendaylight-cluster-data@10.18.130.73:2550,akka.tcp://opendaylight-cluster-data@10.18.130.120:2550,akka.tcp://opendaylight-cluster-data@10.18.130.32:2550", Node 6, here all primary nodes are unreachable. In some runs even secondary nodes also will be listed as unreachable: "Leader": "akka.tcp://opendaylight-cluster-data@10.18.131.41:2550", "Unreachable": "akka.tcp://opendaylight-cluster-data@10.18.130.73:2550,akka.tcp://opendaylight-cluster-data@10.18.130.120:2550,akka.tcp://opendaylight-cluster-data@10.18.130.32:2550", > All nodes wait for a threshold amount of time before making Node B as > quarantined and then reachability within the cluster is restored. > > What is the threshold amount of time it needs to wait? > If the node goes down or stopped, this behavior is not seen. It is seen only > when it is isolated. How is this different from node down? > > Log excerpt from Node A when Node B is isolated: > 130.103:2550] has failed, address is now gated for [5000] ms. Reason: > [Disassociated] > 2018-02-15 19:53:56,109 | INFO | lt-dispatcher-22 | > kka://opendaylight-cluster-data <>) | 113 - com.typesafe.akka.slf4j - 2.4.18 > | Cluster Node [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 <>] - > Leader can currently not perform its duties, reachability status: > [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 <> -> > akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: <>Unreachable > [Unreachable] (1)], member status: > [akka.tcp://opendaylight-cluster-data@10.18.130.103:2550 <> Up seen=false, > akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 <> Up seen=true, > akka.tcp://opendaylight-cluster-data@10.18.130.84:2550 <> Up seen=true, > akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 <> Up seen=true, > akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 <> Up seen=true, > akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 <> Up seen=true] > > If a Shard Leader is Isolated, let’s say you make Node A as shard leader for > all shards and data store. On isolating and un-isolating Node A, I see the > following: > > Primary voting nodes are unreachable to secondary nodes and vice versa. > Cluster never recovers and all nodes need to be restarted to have cluster > working. Is this a bug? > Also the isolated node which is un-isolated is unreachable to primary voting > nodes and never recovers. > > It may be that, on un-isolation, split brain occurred in akka with 2 cluster > leaders. I assume that Node A was the akka cluster leader when it was > isolated - it would be interesting to see if this also occurs if a > non-cluster leader node is isolated. > > Also make sure you do not have the auto-down-unreachable-after option enabled > in the akka.conf. [Chethana] I tried with non-cluster leader node and I see the same behavior. > > > Log excerpt: > 2018-02-15 19:32:47,174 | INFO | lt-dispatcher-19 | > kka://opendaylight-cluster-data <>) | 113 - com.typesafe.akka.slf4j - 2.4.18 > | Cluster Node [akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 <>] - > Leader can currently not perform its duties, reachability status: > [akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 <> -> > akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: <> Unreachable > [Terminated] (1), akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 <> > -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: <> Unreachable > [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 <> > -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: <> Terminated > [Terminated] (4), akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 <> > -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: <> Unreachable > [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 <> > -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: <> Unreachable > [Terminated] (3), akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 <> > -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: <> Unreachable > [Unreachable] (2), akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 <> > -> akka.tcp://opendaylight-cluster-data@10.18.130.103:2550: <> Unreachable > [Terminated] (3), akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 <> > -> akka.tcp://opendaylight-cluster-data@10.18.130.84:2550: <> Unreachable > [Unreachable] (2)], member status: > [akka.tcp://opendaylight-cluster-data@10.18.130.103:2550 <> Down seen=false, > akka.tcp://opendaylight-cluster-data@10.18.130.105:2550 <> WeaklyUp > seen=true, akka.tcp://opendaylight-cluster-data@10.18.130.84:2550 <> Up > seen=false, akka.tcp://opendaylight-cluster-data@10.18.131.27:2550 <> Up > seen=true, akka.tcp://opendaylight-cluster-data@10.18.131.31:2550 <> Up > seen=true, akka.tcp://opendaylight-cluster-data@10.18.131.39:2550 <> Up > seen=true] > > If a Cluster Leader is Isolated, then "DataStoreUnavailableException: Shard > member-2-shard-default-config currently has no leader” exception is seen on > nodes where COMMIT fails: > > Transactions done during this threshold time fail as there is no leader. Is > this acceptable? (as threshold time sometimes is very long) > Transactions will fail if there is no shard leader although it does make > every attempt with timeouts and retries. But at some point it gives up. [Chethana] Is this expected then? > > Also when the Isolated node is un-isolated, sometimes cluster does not > recover and all nodes need to be restarted. Is this a bug? [Chethana] I see sometimes that cluster does not recover after Cluster leader is isolated and rejoined, only option to restart all the nodes. How to resolve this? > > Log excerpt on Node F: > 2018-02-15 19:54:12,625 | ERROR | a-change-notif-0 | MdSalHelper > | 96 - com.luminanetworks.lsc.app.lsc-app-nodecounter-impl - > 1.0.0.SNAPSHOT | DataStore Tx encountered error > TransactionCommitFailedException{message=canCommit encountered an unexpected > failure, errorList=[RpcError [message=canCommit encountered an unexpected > failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, > applicationTag=null, info=null, > cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: > Shard member-6-shard-default-operational currently has no leader. Try again > later.]]} > > If a follower is isolated and un-isolated, shard leader is re-elected. > Cluster already had a shard leader, so, should re-election happen? > It can happen if the follower is able to send out a RequestVote after > un-isolation. From the follower's perspective there is no leader so it tries > to become leader - this is the way RAFT works. > > > > Thanks, > Chethana > > > > _______________________________________________ > controller-dev mailing list > controller-dev@lists.opendaylight.org > <mailto:controller-dev@lists.opendaylight.org> > https://lists.opendaylight.org/mailman/listinfo/controller-dev > <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
_______________________________________________ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev