Hi Eranda and all, I have developed a patch resolving following.
1. Broker Kill is no more a problem. According to above strategy other nodes clear the persisted state of killed node. 2. The edge case I have mentioned is also handled cross referring to stored nodes and actual nodes coming from Zookeeper (when joining to cluster) and clearing up the states. 3. Consider following scenario. MB1 start (node id 0)>> MB2 start >> MB3 start >> MB2 kiiled >> MB3 killed >> MB1 killed . Now consider a scenario where MB1 is started again. There is a possibility that ZK chooses an id other than 0. Then the tricky situation that can arise is also handled by above logic. 4. Tested for queues (developer testing) - works fine. Need to test for topics / durable topics etc. There is a lot of test cases one can imagine with combinations possible. 5. Also did some modifications to cluster manager, so that I think startup time of MB is little improved as well!! Thanks On Sun, Dec 15, 2013 at 10:35 PM, Eranda Sooriyabandara <[email protected]>wrote: > Hi Hasitha, > > > On Fri, Dec 13, 2013 at 5:22 PM, Hasitha Hiranya <[email protected]>wrote: > >> Hi, >> >> There can be instances where a broker node in a MB cluster crashed when >> failover between MB nodes in a cluster is configured, or while doing a >> load balancing using a F5 hardware load balancer. >> >> What Happens If a Node in Broker Cluster Crashed (Kill -9)? >> >> Previous behavior was that messages were completely lost until same node >> comes up. Reason was we persists subscription information, destination >> queue to node queue mappings in Cassandra DB which is shared across all the >> nodes. When a broker gracefully shutdown it closes it subscriptions, update >> Cassandra and goes away so that other nodes are up-to-date that >> subscriptions of this node is gone and also node is gone. >> >> But when a node crash happens Cassandra is not updated, hence other nodes >> think this node still there and it has active subscriptions. >> >> In order to solve the problem, I implemented following, >> >> 1. Zookeeper node Existance listener is registered in all other nodes for >> each node in MB cluster. >> 2. When a node crashes ZK notifies that to all the nodes. >> 3. A particular node which is alive now will clear up the state in >> cassandra on behalf of disappeared node. It will do following (logic to >> choose node was to get the node with minimum zk id when sorted all ids of >> living nodes). >> >> 3.1 remove node id from nodes list (all nodes will do this) >> 3.2 remove this node queue from all destination queues and update >> in-memory map >> 3.3 update subscription counts >> 3.4 remove topic subscriptions of disappeared node and update in-memory >> map. There if durable mark subscription as it having no exclusive >> subscription otherwise delete the entry. >> 3.5 notify other nodes queue subscriptions changed in cluster so they >> will update their memory >> 3.6 notify other nodes topic subscriptions changed in cluster so they >> will update their memory >> >> 4.as last step move all the messages in node queue of deleted node back >> to global queue. >> >> 5. Redistribute shared global queue worker threads among active nodes. >> >> This way most of the scenarios will work without message lost. >> But when there is one last node in cluster, it was also down (killed), >> then once again first node comes up we will hit a stopping point again (but >> IMO it is fair) >> > > If we can identify the first cluster node start up we can do the step 3 at > the beginning. WDYT? > > thanks > Eranda > > > > >> >> Thanks >> >> -- >> *Hasitha Abeykoon* >> Software Engineer; WSO2, Inc.; http://wso2.com >> *cell:* *+94 719363063* >> *blog: **abeykoon.blogspot.com* <http://abeykoon.blogspot.com> >> >> >> _______________________________________________ >> Architecture mailing list >> [email protected] >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > > > -- > > *Eranda Sooriyabandara*Senior Software Engineer; > Integration Technologies Team; > WSO2 Inc.; http://wso2.com > Lean . Enterprise . Middleware > > E-mail: eranda AT wso2.com > Mobile: +94 716 472 816 > Linked-In: http://www.linkedin.com/in/erandasooriyabandara > Blog: http://emsooriyabandara.blogspot.com/ > > > > > > _______________________________________________ > Architecture mailing list > [email protected] > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- *Hasitha Abeykoon* Software Engineer; WSO2, Inc.; http://wso2.com *cell:* *+94 719363063* *blog: **abeykoon.blogspot.com* <http://abeykoon.blogspot.com>
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
