Hi, There can be instances where a broker node in a MB cluster crashed when failover between MB nodes in a cluster is configured, or while doing a load balancing using a F5 hardware load balancer.
What Happens If a Node in Broker Cluster Crashed (Kill -9)? Previous behavior was that messages were completely lost until same node comes up. Reason was we persists subscription information, destination queue to node queue mappings in Cassandra DB which is shared across all the nodes. When a broker gracefully shutdown it closes it subscriptions, update Cassandra and goes away so that other nodes are up-to-date that subscriptions of this node is gone and also node is gone. But when a node crash happens Cassandra is not updated, hence other nodes think this node still there and it has active subscriptions. In order to solve the problem, I implemented following, 1. Zookeeper node Existance listener is registered in all other nodes for each node in MB cluster. 2. When a node crashes ZK notifies that to all the nodes. 3. A particular node which is alive now will clear up the state in cassandra on behalf of disappeared node. It will do following (logic to choose node was to get the node with minimum zk id when sorted all ids of living nodes). 3.1 remove node id from nodes list (all nodes will do this) 3.2 remove this node queue from all destination queues and update in-memory map 3.3 update subscription counts 3.4 remove topic subscriptions of disappeared node and update in-memory map. There if durable mark subscription as it having no exclusive subscription otherwise delete the entry. 3.5 notify other nodes queue subscriptions changed in cluster so they will update their memory 3.6 notify other nodes topic subscriptions changed in cluster so they will update their memory 4.as last step move all the messages in node queue of deleted node back to global queue. 5. Redistribute shared global queue worker threads among active nodes. This way most of the scenarios will work without message lost. But when there is one last node in cluster, it was also down (killed), then once again first node comes up we will hit a stopping point again (but IMO it is fair) Thanks -- *Hasitha Abeykoon* Software Engineer; WSO2, Inc.; http://wso2.com *cell:* *+94 719363063* *blog: **abeykoon.blogspot.com* <http://abeykoon.blogspot.com>
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
