On 3/18/24 17:15, Felix Huettner wrote:
> On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
>> Each cluster member typically always transfers leadership to the same
>> other member, which is the first in their list of servers.  This may
>> result in two servers in a 3-node cluster to transfer leadership to
>> each other and never to the third one.
>>
>> Randomizing the selection to make the load more evenly distributed.
> 
> Hi Ilya,

Hi, Felix.  Thanks for the comments!

> 
> just out of curiosity: since basically only one of the 3 members is
> active at any point in time, is balancing the load even relevant. It
> will always only be on one of the 3 members anyway.
It is not very important, I agree.  What I observed in practice is
that sometimes if, for example, compactions happen in approximately
similar time, the server we transfer the leadership to may send it
right back, while the first server is busy compacting.  This is
less of a problem today as well, since we have parallel compaction,
but it may still be annoying if that happens every time.

I'm mostly making this patch for the purpose of better testing below.

> 
>>
>> This also makes cluster failure tests cover more scenarios as servers
>> will transfer leadership to servers they didn't before.  This is
>> important especially for cluster joining tests.
>>
>> Ideally, we would transfer to a random server with a highest apply
>> index, but not trying to implement this for now.
>>
>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
>> ---
>>  ovsdb/raft.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> index f463afcb3..25f462431 100644
>> --- a/ovsdb/raft.c
>> +++ b/ovsdb/raft.c
>> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, const 
>> char *reason)
>>          return;
>>      }
>>  
>> +    size_t n = hmap_count(&raft->servers) * 3;
>>      struct raft_server *s;
>> -    HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
>> +
>> +    while (n--) {
>> +        s = CONTAINER_OF(hmap_random_node(&raft->servers),
>> +                         struct raft_server, hmap_node);
>>          if (!uuid_equals(&raft->sid, &s->sid)
>>              && s->phase == RAFT_PHASE_STABLE) {
>>              struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);
> 
> i think this has the risk of never selecting one server out of the list of
> cluster members. Suppose you have a 3 node cluster where one of them
> members is down. In this case there is a single member the leadership
> can be transfered to.
> This means for a single iteration of the while loop has a chance of 2/3
> to select a member that can not be used. Over the 9 iterations this
> would do this would give a chance of (2/3)^9 to always choose an
> inappropriate member. This equals to a chance of 0.026% of never
> selecting the single appropriate target member.
> 
> Could we instead rather start iterating the hmap at some random offset?
> That should give a similar result while giving the guarantee that we
> visit each member once.

I don't think it's very important, because we're transferring leadership
without verifying if the other side is alive and we're not checking if we
actually transferred it or not.  So, retries are basically just for the
server not hitting itself or servers that didn't join yet.  We will transfer
to the first other server that already joined regardless of the iteration
method.

The way to mostly fix the issue with dead servers is, as I mentioned, to
transfer only to the servers with the highest apply index, i.e. the servers
that acknowledged the most amount of changes.  This will ensure that we
at least heard something from the server in the recent past.  But that's
a separate feature to implement.

Also, the leadership transfer is just an optimization to speed up elections,
so it's not necessary for correctness for this operation to be successful.
If we fail to transfer or transfer to the dead server, the rest of the
cluster will notice the absence of the leader and initiate election by
the timeout.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to