On Fri, May 14, 2021 at 7:05 AM Ilya Maximets <[email protected]> wrote: > > On 5/10/21 3:16 PM, Dumitru Ceara wrote: > > On 5/6/21 2:47 PM, Ilya Maximets wrote: > > > > [...] > > > > Oops, I meant to send this along with the acked-by: > > > >> > >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c > >> index d2ff643b2..658a016b2 100644 > >> --- a/ovsdb/raft.c > >> +++ b/ovsdb/raft.c > >> @@ -4160,9 +4160,24 @@ raft_may_snapshot(const struct raft *raft) > >> && !raft->leaving > >> && !raft->left > >> && !raft->failed > >> + && raft->role != RAFT_LEADER > >> && raft->last_applied >= raft->log_start); > >> } > >> > >> +/* Prepares for soon snapshotting. */ > > > > Nit: one space too many. And I'd probably rephrase this to "Prepare for > > snapshotting soon.", but I'm not a native speaker, I'm not sure if > > that's better. > > > > Thanks! I'm not sure if this variant is better and I wanted to keep > comments uniform with other functions. So, I removed the extra space, > but kept the text as is. > > Applied to master and backported down to 2.13 as this change is important > for large scale setups. > > Best regards, Ilya Maximets.
Hi Ilya, thanks for fixing the problem and sorry for my late response. This approach looks reasonable and the code looks good, but I still have some concerns not sure if you had already thought about or tested. This patch increases the chance that a leader is responsive and so the cluster can process new transactions. (I suppose the leader transfer process completes immediately, but I didn't test.) However, it also increases the frequency of leader elections - every snapshot on leader node would trigger a leader election. We set the leader election timer big so that in large scale deployment leader election is not triggered unnecessarily, because it introduces churns in the cluster. For example, ovn-northd (and probably a lot of CMS writable client workers that use lead-only option) would need to reconnect to the new leader. Also, ongoing transactions (including those initiated by followers) would all get disturbed and repropagated through the new leader (not sure if there can be failures in untested corner cases). If the snapshot time is not beyond the requirement of the response time in a production deployment, I'd rather let the leader keep its role and finish the snapshot instead of introducing these churns (and of course setting the election timer long enough for the snapshot to complete). If the snapshot time is beyond the requirement, transferring the leader role sounds reasonable, but not really sure if the frequent leader election in a large scale deployment would increase the chance of hitting problems of corner cases given that we don't have enough test cases for raft leader transfer and its impact to the ongoing transactions. There may be even more complex scenarios that need to be considered if we do this. What if other nodes are doing snapshot at the same time or at very close time? Since all the nodes' logs are synced it is likely that they would do snapshot close to each other. Suppose the leader is trying to do it first and then one round of snapshot may trigger the leader transfer 3 times (in a 3-node cluster). If there is already a follower node doing snapshot in a 3-node cluster, then transfering the leader would not help the response time because in the next couple of seconds only the new leader is not doing snapshot but any transaction needs at least one of the followers to replicate the change and respond to the leader. In addition, is there any chance the install_snapshot procedure kicks in and what would be the impact? These are just my random thoughts and maybe some of them aren't real problems at all, but I just want to put them here and see if they bring any cautions or are already ruled out. Thanks, Han _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
