Hi Hasitha,
On Fri, Dec 13, 2013 at 5:22 PM, Hasitha Hiranya <[email protected]> wrote: > Hi, > > There can be instances where a broker node in a MB cluster crashed when > failover between MB nodes in a cluster is configured, or while doing a > load balancing using a F5 hardware load balancer. > > What Happens If a Node in Broker Cluster Crashed (Kill -9)? > > Previous behavior was that messages were completely lost until same node > comes up. Reason was we persists subscription information, destination > queue to node queue mappings in Cassandra DB which is shared across all the > nodes. When a broker gracefully shutdown it closes it subscriptions, update > Cassandra and goes away so that other nodes are up-to-date that > subscriptions of this node is gone and also node is gone. > > But when a node crash happens Cassandra is not updated, hence other nodes > think this node still there and it has active subscriptions. > > In order to solve the problem, I implemented following, > > 1. Zookeeper node Existance listener is registered in all other nodes for > each node in MB cluster. > 2. When a node crashes ZK notifies that to all the nodes. > 3. A particular node which is alive now will clear up the state in > cassandra on behalf of disappeared node. It will do following (logic to > choose node was to get the node with minimum zk id when sorted all ids of > living nodes). > > 3.1 remove node id from nodes list (all nodes will do this) > 3.2 remove this node queue from all destination queues and update > in-memory map > 3.3 update subscription counts > 3.4 remove topic subscriptions of disappeared node and update in-memory > map. There if durable mark subscription as it having no exclusive > subscription otherwise delete the entry. > 3.5 notify other nodes queue subscriptions changed in cluster so they will > update their memory > 3.6 notify other nodes topic subscriptions changed in cluster so they > will update their memory > > 4.as last step move all the messages in node queue of deleted node back > to global queue. > > 5. Redistribute shared global queue worker threads among active nodes. > > This way most of the scenarios will work without message lost. > But when there is one last node in cluster, it was also down (killed), > then once again first node comes up we will hit a stopping point again (but > IMO it is fair) > If we can identify the first cluster node start up we can do the step 3 at the beginning. WDYT? thanks Eranda > > Thanks > > -- > *Hasitha Abeykoon* > Software Engineer; WSO2, Inc.; http://wso2.com > *cell:* *+94 719363063* > *blog: **abeykoon.blogspot.com* <http://abeykoon.blogspot.com> > > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- *Eranda Sooriyabandara*Senior Software Engineer; Integration Technologies Team; WSO2 Inc.; http://wso2.com Lean . Enterprise . Middleware E-mail: eranda AT wso2.com Mobile: +94 716 472 816 Linked-In: http://www.linkedin.com/in/erandasooriyabandara Blog: http://emsooriyabandara.blogspot.com/
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
