Hi,

There can be instances where a broker node in a MB cluster crashed when
 failover between MB nodes in a cluster is configured, or while doing a
load balancing using a F5 hardware load balancer.

What Happens If a Node in Broker Cluster Crashed (Kill -9)?

Previous behavior was that messages were completely lost until same node
comes up. Reason was we persists subscription information, destination
queue to node queue mappings in Cassandra DB which is shared across all the
nodes. When a broker gracefully shutdown it closes it subscriptions, update
Cassandra and goes away so that other nodes are up-to-date that
subscriptions of this node is gone and also node is gone.

But when a node crash happens Cassandra is not updated, hence other nodes
think this node still there and it has active subscriptions.

In order to solve the problem, I implemented following,

1. Zookeeper node Existance listener is registered in all other nodes for
each node in MB cluster.
2. When a node crashes ZK notifies that to all the nodes.
3. A particular node which is alive now will clear up the state in
cassandra on behalf of disappeared node. It will do following (logic to
choose node was to get the node with minimum zk id when sorted all ids of
living nodes).

3.1 remove node id from nodes list (all nodes will do this)
3.2 remove this node queue from all destination queues  and update
in-memory map
3.3 update subscription counts
3.4 remove topic subscriptions of disappeared node and update in-memory
map. There if durable mark subscription as it having no exclusive
subscription otherwise delete the entry.
3.5 notify other nodes queue subscriptions changed in cluster so they will
update their memory
3.6 notify  other nodes topic subscriptions changed in cluster so they will
update their memory

4.as last step move all the messages in node queue of deleted node back to
global queue.

5. Redistribute shared global queue worker threads among active nodes.

This way most of the scenarios will work without message lost.
But when there is one last node in cluster, it was also down (killed), then
once again first node comes up we will hit a stopping point again (but IMO
it is fair)

Thanks

-- 
*Hasitha Abeykoon*
Software Engineer; WSO2, Inc.; http://wso2.com
*cell:* *+94 719363063*
*blog: **abeykoon.blogspot.com* <http://abeykoon.blogspot.com>
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to