Re: [Architecture] RDBMS based coordinator election algorithm for MB

Asitha Nanayakkara Thu, 04 Aug 2016 20:54:06 -0700

Hi Imesh,

On Fri, Aug 5, 2016 at 7:33 AM, Imesh Gunaratne <[email protected]> wrote:


>
>
> On Fri, Aug 5, 2016 at 7:31 AM, Imesh Gunaratne <[email protected]> wrote:
>>
>>
>> You can see here [3] how K8S has implemented leader election feature for
>> the products deployed on top of that to utilize.
>>
>
> Correction: Please refer [4].
>
>
>>
>>
>>> On Thu, Aug 4, 2016 at 7:27 PM, Asanka Abeyweera <[email protected]>
>>> wrote:
>>>
>>>> Hi Imesh,
>>>>
>>>> We are not implementing this to overcome a limitation in the
>>>> coordination algorithm available in the Hazlecast. We are implementing this
>>>> since we need an RDBMS based coordination algorithm (not a network based
>>>> algorithm).
>>>>
>>>
>> Are you saying that database connections do not use the same network
>> used by Hazelcast?
>> 
>>
>>
>>> The reason is, a network based election algorithm will always elect
>>>> multiple leaders when the network is partitioned. But if we use a RDBMS
>>>> based algorithm this will not happen.
>>>>
>>>
>> I do not think your argument is correct. If there is a problem with the
>> network, it may apply to both Hazelcast based solution and database based
>> solution.
>>
>
Yes, if the same network interface is used network partion will cause all
types of connections to be partitioned. But user can use multiple network
interfaces for database, Hazelcast and thrift.

Following is the scenario we are trying to solve in MB.

In MB all the details related to messages, subscriptions, queues, topics
etc are stored in database. And we operate depending on that information.
If the MB node can't connect to the database that means the node is
ineffective in the cluster until it can make a database connection.

We have seen instances where Hazelcast cluster get partitioned for some
time period in networks, Reasons were,

   1. Due to heavy load Hazelcast couldn't process or send (some times
   both) hearbeats, hence a network partition for Hazelcast cluster
   2. An actual network partition of Hazelcast cluster

In both scenarios the database connection was working. In that case we get
two coordinators elected through Hazelcast and working on the same database
to deliver the messages. this leads to inconsistencies in the cluster
behavior (for instances duplicate message delivery, corrupred subscription
states etc) .

Since the point of interest for MB is the database, we decided to do the
coordinator election through database as well. If the node can't connect to
the database, then the MB won't operate anyway.

Regards,
Asitha


>> [4] http://blog.kubernetes.io/2016/01/simple-leader-election
>> -with-Kubernetes.html
>>
>> Thanks
>>
>>>
>>>>
>>>> On Thu, Aug 4, 2016 at 7:16 PM, Imesh Gunaratne <[email protected]> wrote:
>>>>
>>>>> Hi Asanka,
>>>>>
>>>>> Do we really need to implement a leader election algorithm on our own?
>>>>> AFAIU this is a complex problem which has been already solved by several
>>>>> algorithms [1]. IMO it would be better to go ahead with an existing well
>>>>> established implementation on etcd [1] or Consul [2].
>>>>>
>>>>> Those provide HTTP APIs for clients to make leader election calls. [3]
>>>>> is a client library written in Node.js for etcd based leader election.
>>>>>
>>>>> [1] https://www.projectcalico.org/using-etcd-for-elections
>>>>> [2] https://www.consul.io/docs/guides/leader-election.html
>>>>> [3] https://www.npmjs.com/package/etcd-leader
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Wed, Aug 3, 2016 at 5:12 PM, Asanka Abeyweera <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Maninda,
>>>>>>
>>>>>> Since we are using RDBMS to poll the node status, the cluster will
>>>>>> not end up in situation 1,2 or 3. With this approach we consider a node
>>>>>> unreachable when it cannot access the database. Therefore an unreachable
>>>>>> node can never be the leader.
>>>>>>
>>>>>> As you have mentioned, we are currently using the RDBMS as an atomic
>>>>>> global variable to create the coordinator entry.
>>>>>>
>>>>>> On Tue, Aug 2, 2016 at 5:22 PM, Maninda Edirisooriya <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Asanka,
>>>>>>>
>>>>>>> As I understand the accuracy of electing the leader correctly is
>>>>>>> dependent on the election mechanism with RDBMS because there can be edge
>>>>>>> cases like,
>>>>>>>
>>>>>>> 1. Unreachable leader activates during the election process: Then
>>>>>>> who becomes the leader?
>>>>>>> 2. The elected leader becomes unreachable before the election is
>>>>>>> completed: Then will there be a situation where there is no leader?
>>>>>>> 3. A leader and a set of nodes are disconnected from the other part
>>>>>>> of the cluster and while the leader is trying to remove unreachable 
>>>>>>> members
>>>>>>> other part is calling an election to make a leader: Who will win?
>>>>>>>
>>>>>>> RDBMS based election algorithm should handle such cases without
>>>>>>> bringing the cluster to an inconsistent state or dead lock in all
>>>>>>> concurrent cases. If all these kind of cases cannot be handled isn't it
>>>>>>> better to keep the current hazelcast clustering and use the RDBMS only 
>>>>>>> to
>>>>>>> handle the split brain scenario? In other words when a new hazelcast 
>>>>>>> leader
>>>>>>> is elected it should be updated in the RDBMS. If another split party has
>>>>>>> already elected a leader, the node who is going to write it to RDBMS 
>>>>>>> should
>>>>>>> avoid updating it. Simply, the RDBMS can be used as an atomic global
>>>>>>> variable to keep the leader name by modifying the hazelcast clustering.
>>>>>>> WDYT?
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>> *Maninda Edirisooriya*
>>>>>>> Senior Software Engineer
>>>>>>>
>>>>>>> *WSO2, Inc.*lean.enterprise.middleware.
>>>>>>>
>>>>>>> *Blog* : http://maninda.blogspot.com/
>>>>>>> *E-mail* : [email protected]
>>>>>>> *Skype* : @manindae
>>>>>>> *Twitter* : @maninda
>>>>>>>
>>>>>>> On Thu, Jul 28, 2016 at 4:38 PM, Asanka Abeyweera <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi Akila,
>>>>>>>>
>>>>>>>> Let me explain the issue in a different way. Let's assume the MB
>>>>>>>> nodes are using two different network interfaces for Hazelcast
>>>>>>>> communication and database communication. With such a configuration, 
>>>>>>>> there
>>>>>>>> can be failures only in the network interface used for Hazelcast
>>>>>>>> communication in some nodes. When this happens, there will be two or 
>>>>>>>> more
>>>>>>>> Hazelcast clusters due to the network segmentation, and as a result 
>>>>>>>> there
>>>>>>>> will be multiple coordinators. Since every node still have access to 
>>>>>>>> the
>>>>>>>> database, multiple coordinators can affect the correctness of the data
>>>>>>>> stored in the DB. But if we used a RDBMS based approach we won't have
>>>>>>>> multiple coordinators due to a network partition in Hazelcast. This is 
>>>>>>>> one
>>>>>>>> advantage we get from this approach.
>>>>>>>>
>>>>>>>> Even when we use Zookeeper or RAFT the same issue will be there
>>>>>>>> since we are using different interfaces for Hazelcast communication 
>>>>>>>> and DB
>>>>>>>> communication.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 28, 2016 at 2:56 PM, Akila Ravihansa Perera <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> What's the advantage of using RDBMS (even as an alternative) to
>>>>>>>>> implement a leader/coordinator election? If the network connection to 
>>>>>>>>> DB
>>>>>>>>> fails then this will be a single point of failure. I don't think we 
>>>>>>>>> can
>>>>>>>>> scale RDBMS instances and expect the election algorithm to work. That 
>>>>>>>>> would
>>>>>>>>> be reducing this problem to another problem (electing coordinator 
>>>>>>>>> RDBMS
>>>>>>>>> instance).
>>>>>>>>>
>>>>>>>>> IMHO it would be better to look at Zookeeper Atomic Broadcast
>>>>>>>>> (ZAB) [1] or RAFT leader election [2] algorithms which have already 
>>>>>>>>> proven
>>>>>>>>> results.
>>>>>>>>>
>>>>>>>>> [1] https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab1.0
>>>>>>>>> [2] http://libraft.io/
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> On Thu, Jul 28, 2016 at 1:42 PM, Nandika Jayawardana <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> +1 to make it a common component . We have the clustering
>>>>>>>>>> implementation for BPEL component based on hazelcast.  If the 
>>>>>>>>>> coordination
>>>>>>>>>> is available at RDBMS level, we can remove hazelcast dependancy.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Nandika
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 28, 2016 at 1:28 PM, Hasitha Aravinda <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Can we make it a common component, which is not hard coupled
>>>>>>>>>>> with MB. BPS has the same requirement.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Hasitha.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 28, 2016 at 9:47 AM, Asanka Abeyweera <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> In MB, we have used a coordinator based approach to manage
>>>>>>>>>>>> distributed messaging algorithm in the cluster. Currently 
>>>>>>>>>>>> Hazelcast is used
>>>>>>>>>>>> to elect the coordinator. But one issue we faced with Hazelcast 
>>>>>>>>>>>> is, during
>>>>>>>>>>>> a network segmentation (split brain), Hazelcast can elect two or 
>>>>>>>>>>>> more
>>>>>>>>>>>> coordinators in the cluster. This affects the correctness of the
>>>>>>>>>>>> distributed messaging algorithm since there are some tables in the 
>>>>>>>>>>>> database
>>>>>>>>>>>> that should only be edited by a single node (i.e. coordinator).
>>>>>>>>>>>>
>>>>>>>>>>>> As a solution to this problem we have implemented minimum node
>>>>>>>>>>>> count based approach [1] to deactivate set of partitioned nodes to 
>>>>>>>>>>>> stop
>>>>>>>>>>>> multiple nodes becoming coordinators until the network 
>>>>>>>>>>>> segmentation issue
>>>>>>>>>>>> is fixed.
>>>>>>>>>>>>
>>>>>>>>>>>> As an alternative solution, we are thinking of implementing an
>>>>>>>>>>>> RDBMS based approach to elect the coordinator node in the cluster. 
>>>>>>>>>>>> By doing
>>>>>>>>>>>> this we can make sure that even during a network segmentation only 
>>>>>>>>>>>> one node
>>>>>>>>>>>> will be elected as the coordinator node since the election is 
>>>>>>>>>>>> happening
>>>>>>>>>>>> through the database.
>>>>>>>>>>>>
>>>>>>>>>>>> The algorithm will use a polling mechanism to check the
>>>>>>>>>>>> validity of the nodes. To make the election algorithm scalable, 
>>>>>>>>>>>> only the
>>>>>>>>>>>> coordinator node will be checking status of all the nodes in the 
>>>>>>>>>>>> cluster
>>>>>>>>>>>> and it will inform other nodes through database when a member is
>>>>>>>>>>>> added/left. The nodes will be only checking for the status of the
>>>>>>>>>>>> coordinator node. When a node detect that coordinator is invalid 
>>>>>>>>>>>> it will go
>>>>>>>>>>>> for a election to elect a new coordinator.
>>>>>>>>>>>>
>>>>>>>>>>>> We are currently working on a POC to test how this works with
>>>>>>>>>>>> MB's slot based messaging algorithm.
>>>>>>>>>>>>
>>>>>>>>>>>> thoughts?
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://wso2.org/jira/browse/MB-1664
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Asanka Abeyweera
>>>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>>> WSO2 Inc.
>>>>>>>>>>>>
>>>>>>>>>>>> Phone: +94 712228648
>>>>>>>>>>>> Blog: a5anka.github.io
>>>>>>>>>>>>
>>>>>>>>>>>> <https://wso2.com/signature>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Architecture mailing list
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> --
>>>>>>>>>>> Hasitha Aravinda,
>>>>>>>>>>> Associate Technical Lead,
>>>>>>>>>>> WSO2 Inc.
>>>>>>>>>>> Email: [email protected]
>>>>>>>>>>> Mobile : +94 718 210 200
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Architecture mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nandika Jayawardana
>>>>>>>>>> WSO2 Inc ; http://wso2.com
>>>>>>>>>> lean.enterprise.middleware
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Architecture mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Akila Ravihansa Perera
>>>>>>>>> WSO2 Inc.;  http://wso2.com/
>>>>>>>>>
>>>>>>>>> Blog: http://ravihansa3000.blogspot.com
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Architecture mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Asanka Abeyweera
>>>>>>>> Senior Software Engineer
>>>>>>>> WSO2 Inc.
>>>>>>>>
>>>>>>>> Phone: +94 712228648
>>>>>>>> Blog: a5anka.github.io
>>>>>>>>
>>>>>>>> <https://wso2.com/signature>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Architecture mailing list
>>>>>>>> [email protected]
>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Asanka Abeyweera
>>>>>> Senior Software Engineer
>>>>>> WSO2 Inc.
>>>>>>
>>>>>> Phone: +94 712228648
>>>>>> Blog: a5anka.github.io
>>>>>>
>>>>>> <https://wso2.com/signature>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Architecture mailing list
>>>>>> [email protected]
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Imesh Gunaratne*
>>>>> Software Architect
>>>>> WSO2 Inc: http://wso2.com
>>>>> T: +94 11 214 5345 M: +94 77 374 2057
>>>>> W: https://medium.com/@imesh TW: @imesh
>>>>> lean. enterprise. middleware
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Asanka Abeyweera
>>>> Senior Software Engineer
>>>> WSO2 Inc.
>>>>
>>>> Phone: +94 712228648
>>>> Blog: a5anka.github.io
>>>>
>>>> <https://wso2.com/signature>
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>> Ramith Jayasinghe
>>> Technical Lead
>>> WSO2 Inc., http://wso2.com
>>> lean.enterprise.middleware
>>>
>>> E: [email protected]
>>> P: +94 772534930
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> [email protected]
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>> *Imesh Gunaratne*
>> Software Architect
>> WSO2 Inc: http://wso2.com
>> T: +94 11 214 5345 M: +94 77 374 2057
>> W: https://medium.com/@imesh TW: @imesh
>> lean. enterprise. middleware
>>
>>
>
>
> --
> *Imesh Gunaratne*
> Software Architect
> WSO2 Inc: http://wso2.com
> T: +94 11 214 5345 M: +94 77 374 2057
> W: https://medium.com/@imesh TW: @imesh
> lean. enterprise. middleware
>
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
*Asitha Nanayakkara* <http://asitha.github.io/>
Senior Software Engineer
WSO2, Inc. <http://wso2.com/>
Mob: +94 77 853 0682
[image: https://wso2.com/signature] <https://wso2.com/signature>

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] RDBMS based coordinator election algorithm for MB

Reply via email to