Hi Hasitha,

On Fri, Dec 13, 2013 at 5:22 PM, Hasitha Hiranya <[email protected]> wrote:

> Hi,
>
> There can be instances where a broker node in a MB cluster crashed when
>  failover between MB nodes in a cluster is configured, or while doing a
> load balancing using a F5 hardware load balancer.
>
> What Happens If a Node in Broker Cluster Crashed (Kill -9)?
>
> Previous behavior was that messages were completely lost until same node
> comes up. Reason was we persists subscription information, destination
> queue to node queue mappings in Cassandra DB which is shared across all the
> nodes. When a broker gracefully shutdown it closes it subscriptions, update
> Cassandra and goes away so that other nodes are up-to-date that
> subscriptions of this node is gone and also node is gone.
>
> But when a node crash happens Cassandra is not updated, hence other nodes
> think this node still there and it has active subscriptions.
>
> In order to solve the problem, I implemented following,
>
> 1. Zookeeper node Existance listener is registered in all other nodes for
> each node in MB cluster.
> 2. When a node crashes ZK notifies that to all the nodes.
> 3. A particular node which is alive now will clear up the state in
> cassandra on behalf of disappeared node. It will do following (logic to
> choose node was to get the node with minimum zk id when sorted all ids of
> living nodes).
>
> 3.1 remove node id from nodes list (all nodes will do this)
> 3.2 remove this node queue from all destination queues  and update
> in-memory map
> 3.3 update subscription counts
> 3.4 remove topic subscriptions of disappeared node and update in-memory
> map. There if durable mark subscription as it having no exclusive
> subscription otherwise delete the entry.
> 3.5 notify other nodes queue subscriptions changed in cluster so they will
> update their memory
> 3.6 notify  other nodes topic subscriptions changed in cluster so they
> will update their memory
>
> 4.as last step move all the messages in node queue of deleted node back
> to global queue.
>
> 5. Redistribute shared global queue worker threads among active nodes.
>
> This way most of the scenarios will work without message lost.
> But when there is one last node in cluster, it was also down (killed),
> then once again first node comes up we will hit a stopping point again (but
> IMO it is fair)
>

If we can identify the first cluster node start up we can do the step 3 at
the beginning. WDYT?

thanks
Eranda




>
> Thanks
>
> --
> *Hasitha Abeykoon*
> Software Engineer; WSO2, Inc.; http://wso2.com
> *cell:* *+94 719363063*
> *blog: **abeykoon.blogspot.com* <http://abeykoon.blogspot.com>
>
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 

*Eranda Sooriyabandara*Senior Software Engineer;
Integration Technologies Team;
WSO2 Inc.; http://wso2.com
Lean . Enterprise . Middleware

E-mail: eranda AT wso2.com
Mobile: +94 716 472 816
Linked-In: http://www.linkedin.com/in/erandasooriyabandara
Blog: http://emsooriyabandara.blogspot.com/
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to