Re: [Architecture] WSO2 Message Broker - What Happens If a Node in Broker Cluster Crashed (Kill -9)

Hasitha Hiranya Tue, 17 Dec 2013 08:21:33 -0800

Hi Eranda and all,

I have developed a patch resolving following.


1. Broker Kill is no more a problem. According to above strategy other
nodes clear the persisted state of killed node.
2. The edge case I have mentioned is also handled cross referring to stored
nodes and actual nodes coming from Zookeeper (when joining to cluster) and
clearing up the states.
3. Consider following scenario. MB1 start (node id 0)>> MB2 start >> MB3
start >> MB2 kiiled >> MB3 killed >> MB1 killed . Now consider a scenario
where MB1 is started again. There is a possibility that ZK chooses an id
other than 0. Then the tricky situation that can arise is also handled by
above logic.
4. Tested for queues (developer testing) - works fine. Need to test for
topics / durable topics etc. There is a lot of test cases one can imagine
with combinations possible.
5. Also did some modifications to cluster manager, so that I think startup
time of MB is little improved as well!!

Thanks


On Sun, Dec 15, 2013 at 10:35 PM, Eranda Sooriyabandara <[email protected]>wrote:

> Hi Hasitha,
>
>
> On Fri, Dec 13, 2013 at 5:22 PM, Hasitha Hiranya <[email protected]>wrote:
>
>> Hi,
>>
>> There can be instances where a broker node in a MB cluster crashed when
>>  failover between MB nodes in a cluster is configured, or while doing a
>> load balancing using a F5 hardware load balancer.
>>
>> What Happens If a Node in Broker Cluster Crashed (Kill -9)?
>>
>> Previous behavior was that messages were completely lost until same node
>> comes up. Reason was we persists subscription information, destination
>> queue to node queue mappings in Cassandra DB which is shared across all the
>> nodes. When a broker gracefully shutdown it closes it subscriptions, update
>> Cassandra and goes away so that other nodes are up-to-date that
>> subscriptions of this node is gone and also node is gone.
>>
>> But when a node crash happens Cassandra is not updated, hence other nodes
>> think this node still there and it has active subscriptions.
>>
>> In order to solve the problem, I implemented following,
>>
>> 1. Zookeeper node Existance listener is registered in all other nodes for
>> each node in MB cluster.
>> 2. When a node crashes ZK notifies that to all the nodes.
>> 3. A particular node which is alive now will clear up the state in
>> cassandra on behalf of disappeared node. It will do following (logic to
>> choose node was to get the node with minimum zk id when sorted all ids of
>> living nodes).
>>
>> 3.1 remove node id from nodes list (all nodes will do this)
>> 3.2 remove this node queue from all destination queues  and update
>> in-memory map
>> 3.3 update subscription counts
>> 3.4 remove topic subscriptions of disappeared node and update in-memory
>> map. There if durable mark subscription as it having no exclusive
>> subscription otherwise delete the entry.
>> 3.5 notify other nodes queue subscriptions changed in cluster so they
>> will update their memory
>> 3.6 notify  other nodes topic subscriptions changed in cluster so they
>> will update their memory
>>
>> 4.as last step move all the messages in node queue of deleted node back
>> to global queue.
>>
>> 5. Redistribute shared global queue worker threads among active nodes.
>>
>> This way most of the scenarios will work without message lost.
>> But when there is one last node in cluster, it was also down (killed),
>> then once again first node comes up we will hit a stopping point again (but
>> IMO it is fair)
>>
>
> If we can identify the first cluster node start up we can do the step 3 at
> the beginning. WDYT?
>
> thanks
> Eranda
>
>
>
>
>>
>> Thanks
>>
>> --
>> *Hasitha Abeykoon*
>> Software Engineer; WSO2, Inc.; http://wso2.com
>> *cell:* *+94 719363063*
>> *blog: **abeykoon.blogspot.com* <http://abeykoon.blogspot.com>
>>
>>
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
>
> *Eranda Sooriyabandara*Senior Software Engineer;
> Integration Technologies Team;
> WSO2 Inc.; http://wso2.com
> Lean . Enterprise . Middleware
>
> E-mail: eranda AT wso2.com
> Mobile: +94 716 472 816
> Linked-In: http://www.linkedin.com/in/erandasooriyabandara
> Blog: http://emsooriyabandara.blogspot.com/
>
>
>
>
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
*Hasitha Abeykoon*
Software Engineer; WSO2, Inc.; http://wso2.com
*cell:* *+94 719363063*
*blog: **abeykoon.blogspot.com* <http://abeykoon.blogspot.com>

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] WSO2 Message Broker - What Happens If a Node in Broker Cluster Crashed (Kill -9)

Reply via email to