This is an automated email from the ASF dual-hosted git repository.
hulk pushed a commit to branch unstable
in repository https://gitbox.apache.org/repos/asf/kvrocks.git
The following commit(s) were added to refs/heads/unstable by this push:
new ab41cbb6 fix(replication): potential deadlock when switching master
frequently (#2516)
ab41cbb6 is described below
commit ab41cbb69d081711437e25846c86c920cb5253e1
Author: hulk <[email protected]>
AuthorDate: Mon Sep 2 15:10:02 2024 +0800
fix(replication): potential deadlock when switching master frequently
(#2516)
This closes #2512.
Currently, the replication thread will wait for the worker's exclusive
guard stop before closing db.
But it now stops the worker from running new commands after acquiring the
worker's exclusive guard,
and it might cause deadlock if switches at the same time.
The following steps will show how it may happen:
- T0: client A sent `slaveof MASTER_IP0 MASTER_PORT0`, then the replication
thread was started and waiting for the exclusive guard.
- T1: client B sent `slaveof MASTER_IP1 MASTER_PORT1` and `AddMaster` will
stop the previous replication thread, which is waiting for the exclusive guard.
But the exclusive guard is acquired by the current thread.
The workaround is also straightforward, just stop workers from running new
commands by enabling `is_loading_` to
true before acquiring the lock in the replication thread.
---
src/server/server.cc | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/src/server/server.cc b/src/server/server.cc
index 3381ffbc..393a70d8 100644
--- a/src/server/server.cc
+++ b/src/server/server.cc
@@ -1353,14 +1353,16 @@ void Server::PrepareRestoreDB() {
// accessing, data migration task should be stopped before restoring DB
WaitNoMigrateProcessing();
+ // Workers will disallow to run commands which may access DB, so we should
+ // enable this flag to stop workers from running new commands. And wait for
+ // the exclusive guard to be released to guarantee no worker is running.
+ is_loading_ = true;
+
// To guarantee work threads don't access DB, we should release
'ExclusivityGuard'
// ASAP to avoid user can't receive responses for long time, because the
following
// 'CloseDB' may cost much time to acquire DB mutex.
LOG(INFO) << "[server] Waiting workers for finishing executing commands...";
- {
- auto exclusivity = WorkExclusivityGuard();
- is_loading_ = true;
- }
+ { auto exclusivity = WorkExclusivityGuard(); }
// Cron thread, compaction checker thread, full synchronization thread
// may always run in the background, we need to close db, so they don't
actually work.