Re: [Architecture] RDBMS based coordinator election algorithm for MB

Imesh Gunaratne Thu, 04 Aug 2016 22:53:24 -0700

Hi Asitha/Asanka,

I think it is clear that the issue we have here is mostly related to
Hazelcast.


Now to solve that problem I think it would be better to go ahead with a
generic leader election system for the entire platform rather than writing
one specific to MB. This requirement is there in several other products and
for some a database driven approach might not work.

Therefore it would be better if we can decouple this from the product and
use an interface to talk to a leader election module. This module can
either be implemented as a separate component or utilize an existing system
such as etcd.

To start with I think it would be better to evaluate what etcd and consul
has to offer and check whether they fit to our requirement.

Thanks

On Fri, Aug 5, 2016 at 10:12 AM, Asanka Abeyweera <[email protected]> wrote:

> Hi Imesh,
>
> On Fri, Aug 5, 2016 at 7:33 AM, Imesh Gunaratne <[email protected]> wrote:
>
>>
>>
>> On Fri, Aug 5, 2016 at 7:31 AM, Imesh Gunaratne <[email protected]> wrote:
>>>
>>>
>>> You can see here [3] how K8S has implemented leader election feature for
>>> the products deployed on top of that to utilize.
>>>
>>
>> Correction: Please refer [4].
>>
>>
>>>
>>>
>>>> On Thu, Aug 4, 2016 at 7:27 PM, Asanka Abeyweera <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Imesh,
>>>>>
>>>>> We are not implementing this to overcome a limitation in the
>>>>> coordination algorithm available in the Hazlecast. We are implementing 
>>>>> this
>>>>> since we need an RDBMS based coordination algorithm (not a network based
>>>>> algorithm).
>>>>>
>>>>
>>> Are you saying that database connections do not use the same network
>>> used by Hazelcast?
>>>
>>
> Yes, This is most problematic when two interfaces are used for Hazelcast
> communication and RDBMS communication. Additionally there is an edge case
> even when a single interface is used for both Hazelcast and RDBMS
> communication. When a cluster merge after a network segmentation, there can
> be a delay in Hazelcast detecting the cluster merge. If a database is
> accessed by multiple coordinators during this time, there can be message
> delivery issues like message duplication. Therefore we cannot ignore this
> issue even when the same network is used for Hazelcast and database
> connections.
> 
>
>
>> The reason is, a network based election algorithm will always elect
>>>>> multiple leaders when the network is partitioned. But if we use a RDBMS
>>>>> based algorithm this will not happen.
>>>>>
>>>>
>>> I do not think your argument is correct. If there is a problem with the
>>> network, it may apply to both Hazelcast based solution and database based
>>> solution.
>>>
>>> [4] http://blog.kubernetes.io/2016/01/simple-leader-election
>>> -with-Kubernetes.html
>>>
>>> Thanks
>>>
>>>>
>>>>>
>>>>> On Thu, Aug 4, 2016 at 7:16 PM, Imesh Gunaratne <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Asanka,
>>>>>>
>>>>>> Do we really need to implement a leader election algorithm on our
>>>>>> own? AFAIU this is a complex problem which has been already solved by
>>>>>> several algorithms [1]. IMO it would be better to go ahead with an 
>>>>>> existing
>>>>>> well established implementation on etcd [1] or Consul [2].
>>>>>>
>>>>>> Those provide HTTP APIs for clients to make leader election calls.
>>>>>> [3] is a client library written in Node.js for etcd based leader 
>>>>>> election.
>>>>>>
>>>>>> [1] https://www.projectcalico.org/using-etcd-for-elections
>>>>>> [2] https://www.consul.io/docs/guides/leader-election.html
>>>>>> [3] https://www.npmjs.com/package/etcd-leader
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Wed, Aug 3, 2016 at 5:12 PM, Asanka Abeyweera <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Maninda,
>>>>>>>
>>>>>>> Since we are using RDBMS to poll the node status, the cluster will
>>>>>>> not end up in situation 1,2 or 3. With this approach we consider a node
>>>>>>> unreachable when it cannot access the database. Therefore an unreachable
>>>>>>> node can never be the leader.
>>>>>>>
>>>>>>> As you have mentioned, we are currently using the RDBMS as an atomic
>>>>>>> global variable to create the coordinator entry.
>>>>>>>
>>>>>>> On Tue, Aug 2, 2016 at 5:22 PM, Maninda Edirisooriya <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Asanka,
>>>>>>>>
>>>>>>>> As I understand the accuracy of electing the leader correctly is
>>>>>>>> dependent on the election mechanism with RDBMS because there can be 
>>>>>>>> edge
>>>>>>>> cases like,
>>>>>>>>
>>>>>>>> 1. Unreachable leader activates during the election process: Then
>>>>>>>> who becomes the leader?
>>>>>>>> 2. The elected leader becomes unreachable before the election is
>>>>>>>> completed: Then will there be a situation where there is no leader?
>>>>>>>> 3. A leader and a set of nodes are disconnected from the other part
>>>>>>>> of the cluster and while the leader is trying to remove unreachable 
>>>>>>>> members
>>>>>>>> other part is calling an election to make a leader: Who will win?
>>>>>>>>
>>>>>>>> RDBMS based election algorithm should handle such cases without
>>>>>>>> bringing the cluster to an inconsistent state or dead lock in all
>>>>>>>> concurrent cases. If all these kind of cases cannot be handled isn't it
>>>>>>>> better to keep the current hazelcast clustering and use the RDBMS only 
>>>>>>>> to
>>>>>>>> handle the split brain scenario? In other words when a new hazelcast 
>>>>>>>> leader
>>>>>>>> is elected it should be updated in the RDBMS. If another split party 
>>>>>>>> has
>>>>>>>> already elected a leader, the node who is going to write it to RDBMS 
>>>>>>>> should
>>>>>>>> avoid updating it. Simply, the RDBMS can be used as an atomic global
>>>>>>>> variable to keep the leader name by modifying the hazelcast clustering.
>>>>>>>> WDYT?
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>> *Maninda Edirisooriya*
>>>>>>>> Senior Software Engineer
>>>>>>>>
>>>>>>>> *WSO2, Inc.*lean.enterprise.middleware.
>>>>>>>>
>>>>>>>> *Blog* : http://maninda.blogspot.com/
>>>>>>>> *E-mail* : [email protected]
>>>>>>>> *Skype* : @manindae
>>>>>>>> *Twitter* : @maninda
>>>>>>>>
>>>>>>>> On Thu, Jul 28, 2016 at 4:38 PM, Asanka Abeyweera <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Akila,
>>>>>>>>>
>>>>>>>>> Let me explain the issue in a different way. Let's assume the MB
>>>>>>>>> nodes are using two different network interfaces for Hazelcast
>>>>>>>>> communication and database communication. With such a configuration, 
>>>>>>>>> there
>>>>>>>>> can be failures only in the network interface used for Hazelcast
>>>>>>>>> communication in some nodes. When this happens, there will be two or 
>>>>>>>>> more
>>>>>>>>> Hazelcast clusters due to the network segmentation, and as a result 
>>>>>>>>> there
>>>>>>>>> will be multiple coordinators. Since every node still have access to 
>>>>>>>>> the
>>>>>>>>> database, multiple coordinators can affect the correctness of the data
>>>>>>>>> stored in the DB. But if we used a RDBMS based approach we won't have
>>>>>>>>> multiple coordinators due to a network partition in Hazelcast. This 
>>>>>>>>> is one
>>>>>>>>> advantage we get from this approach.
>>>>>>>>>
>>>>>>>>> Even when we use Zookeeper or RAFT the same issue will be there
>>>>>>>>> since we are using different interfaces for Hazelcast communication 
>>>>>>>>> and DB
>>>>>>>>> communication.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 28, 2016 at 2:56 PM, Akila Ravihansa Perera <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> What's the advantage of using RDBMS (even as an alternative) to
>>>>>>>>>> implement a leader/coordinator election? If the network connection 
>>>>>>>>>> to DB
>>>>>>>>>> fails then this will be a single point of failure. I don't think we 
>>>>>>>>>> can
>>>>>>>>>> scale RDBMS instances and expect the election algorithm to work. 
>>>>>>>>>> That would
>>>>>>>>>> be reducing this problem to another problem (electing coordinator 
>>>>>>>>>> RDBMS
>>>>>>>>>> instance).
>>>>>>>>>>
>>>>>>>>>> IMHO it would be better to look at Zookeeper Atomic Broadcast
>>>>>>>>>> (ZAB) [1] or RAFT leader election [2] algorithms which have already 
>>>>>>>>>> proven
>>>>>>>>>> results.
>>>>>>>>>>
>>>>>>>>>> [1] https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab1.0
>>>>>>>>>> [2] http://libraft.io/
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 28, 2016 at 1:42 PM, Nandika Jayawardana <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 to make it a common component . We have the clustering
>>>>>>>>>>> implementation for BPEL component based on hazelcast.  If the 
>>>>>>>>>>> coordination
>>>>>>>>>>> is available at RDBMS level, we can remove hazelcast dependancy.
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> Nandika
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 28, 2016 at 1:28 PM, Hasitha Aravinda <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Can we make it a common component, which is not hard coupled
>>>>>>>>>>>> with MB. BPS has the same requirement.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Hasitha.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 28, 2016 at 9:47 AM, Asanka Abeyweera <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> In MB, we have used a coordinator based approach to manage
>>>>>>>>>>>>> distributed messaging algorithm in the cluster. Currently 
>>>>>>>>>>>>> Hazelcast is used
>>>>>>>>>>>>> to elect the coordinator. But one issue we faced with Hazelcast 
>>>>>>>>>>>>> is, during
>>>>>>>>>>>>> a network segmentation (split brain), Hazelcast can elect two or 
>>>>>>>>>>>>> more
>>>>>>>>>>>>> coordinators in the cluster. This affects the correctness of the
>>>>>>>>>>>>> distributed messaging algorithm since there are some tables in 
>>>>>>>>>>>>> the database
>>>>>>>>>>>>> that should only be edited by a single node (i.e. coordinator).
>>>>>>>>>>>>>
>>>>>>>>>>>>> As a solution to this problem we have implemented minimum node
>>>>>>>>>>>>> count based approach [1] to deactivate set of partitioned nodes 
>>>>>>>>>>>>> to stop
>>>>>>>>>>>>> multiple nodes becoming coordinators until the network 
>>>>>>>>>>>>> segmentation issue
>>>>>>>>>>>>> is fixed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As an alternative solution, we are thinking of implementing an
>>>>>>>>>>>>> RDBMS based approach to elect the coordinator node in the 
>>>>>>>>>>>>> cluster. By doing
>>>>>>>>>>>>> this we can make sure that even during a network segmentation 
>>>>>>>>>>>>> only one node
>>>>>>>>>>>>> will be elected as the coordinator node since the election is 
>>>>>>>>>>>>> happening
>>>>>>>>>>>>> through the database.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The algorithm will use a polling mechanism to check the
>>>>>>>>>>>>> validity of the nodes. To make the election algorithm scalable, 
>>>>>>>>>>>>> only the
>>>>>>>>>>>>> coordinator node will be checking status of all the nodes in the 
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>> and it will inform other nodes through database when a member is
>>>>>>>>>>>>> added/left. The nodes will be only checking for the status of the
>>>>>>>>>>>>> coordinator node. When a node detect that coordinator is invalid 
>>>>>>>>>>>>> it will go
>>>>>>>>>>>>> for a election to elect a new coordinator.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are currently working on a POC to test how this works with
>>>>>>>>>>>>> MB's slot based messaging algorithm.
>>>>>>>>>>>>>
>>>>>>>>>>>>> thoughts?
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] https://wso2.org/jira/browse/MB-1664
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Asanka Abeyweera
>>>>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>>>> WSO2 Inc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Phone: +94 712228648
>>>>>>>>>>>>> Blog: a5anka.github.io
>>>>>>>>>>>>>
>>>>>>>>>>>>> <https://wso2.com/signature>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Architecture mailing list
>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> --
>>>>>>>>>>>> Hasitha Aravinda,
>>>>>>>>>>>> Associate Technical Lead,
>>>>>>>>>>>> WSO2 Inc.
>>>>>>>>>>>> Email: [email protected]
>>>>>>>>>>>> Mobile : +94 718 210 200
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Architecture mailing list
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nandika Jayawardana
>>>>>>>>>>> WSO2 Inc ; http://wso2.com
>>>>>>>>>>> lean.enterprise.middleware
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Architecture mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Akila Ravihansa Perera
>>>>>>>>>> WSO2 Inc.;  http://wso2.com/
>>>>>>>>>>
>>>>>>>>>> Blog: http://ravihansa3000.blogspot.com
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Architecture mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Asanka Abeyweera
>>>>>>>>> Senior Software Engineer
>>>>>>>>> WSO2 Inc.
>>>>>>>>>
>>>>>>>>> Phone: +94 712228648
>>>>>>>>> Blog: a5anka.github.io
>>>>>>>>>
>>>>>>>>> <https://wso2.com/signature>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Architecture mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Asanka Abeyweera
>>>>>>> Senior Software Engineer
>>>>>>> WSO2 Inc.
>>>>>>>
>>>>>>> Phone: +94 712228648
>>>>>>> Blog: a5anka.github.io
>>>>>>>
>>>>>>> <https://wso2.com/signature>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Architecture mailing list
>>>>>>> [email protected]
>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Imesh Gunaratne*
>>>>>> Software Architect
>>>>>> WSO2 Inc: http://wso2.com
>>>>>> T: +94 11 214 5345 M: +94 77 374 2057
>>>>>> W: https://medium.com/@imesh TW: @imesh
>>>>>> lean. enterprise. middleware
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Architecture mailing list
>>>>>> [email protected]
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Asanka Abeyweera
>>>>> Senior Software Engineer
>>>>> WSO2 Inc.
>>>>>
>>>>> Phone: +94 712228648
>>>>> Blog: a5anka.github.io
>>>>>
>>>>> <https://wso2.com/signature>
>>>>>
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> [email protected]
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ramith Jayasinghe
>>>> Technical Lead
>>>> WSO2 Inc., http://wso2.com
>>>> lean.enterprise.middleware
>>>>
>>>> E: [email protected]
>>>> P: +94 772534930
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> [email protected]
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>>
>>> --
>>> *Imesh Gunaratne*
>>> Software Architect
>>> WSO2 Inc: http://wso2.com
>>> T: +94 11 214 5345 M: +94 77 374 2057
>>> W: https://medium.com/@imesh TW: @imesh
>>> lean. enterprise. middleware
>>>
>>>
>>
>>
>> --
>> *Imesh Gunaratne*
>> Software Architect
>> WSO2 Inc: http://wso2.com
>> T: +94 11 214 5345 M: +94 77 374 2057
>> W: https://medium.com/@imesh TW: @imesh
>> lean. enterprise. middleware
>>
>>
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
> Asanka Abeyweera
> Senior Software Engineer
> WSO2 Inc.
>
> Phone: +94 712228648
> Blog: a5anka.github.io
>
> <https://wso2.com/signature>
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
*Imesh Gunaratne*
Software Architect
WSO2 Inc: http://wso2.com
T: +94 11 214 5345 M: +94 77 374 2057
W: https://medium.com/@imesh TW: @imesh
lean. enterprise. middleware

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] RDBMS based coordinator election algorithm for MB

Reply via email to