This is an automated email from the ASF dual-hosted git repository.
mgreber pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git
The following commit(s) were added to refs/heads/master by this push:
new e9b3dae78 KUDU-3651 fix race condition in TabletReplica::Stop()
e9b3dae78 is described below
commit e9b3dae78edd84b4630e77601f285476ecae2f38
Author: Alexey Serbin <[email protected]>
AuthorDate: Mon Mar 10 22:22:40 2025 -0700
KUDU-3651 fix race condition in TabletReplica::Stop()
This patch addresses a race condition in TabletReplica::Stop(). Before
this patch, new operations might be accepted by a tablet replica right
after calling OpTracker::WaitForAllToFinish() and before completing the
shutdown of the replica's prepare pool token.
The race has been manifesting itself at least as a flakiness in various
test scenarios in txn_participant-test [1]. In one particular instance,
the following TSAN warnings were issued while running the
TxnParticipantTest.TestBeginCommitAnchorsOnFlush scenario:
WARNING: ThreadSanitizer: data race (pid=4116)
Write of size 8 at 0x7b4400027688 by main thread:
#0 std::__1::__vector_base<kudu::MemTracker*,
std::__1::allocator<kudu::MemTracker*> >::__destruct_at_end(kudu::MemTracker**)
...
#3 std::__1::vector<kudu::MemTracker*,
std::__1::allocator<kudu::MemTracker*> >::~vector()
#4 kudu::MemTracker::~MemTracker() mem_tracker.cc:83:1
...
#9 kudu::tablet::OpTracker::~OpTracker()
#10 kudu::tablet::TabletReplica::~TabletReplica()
...
#16
scoped_refptr<kudu::tablet::TabletReplica>::reset(kudu::tablet::TabletReplica*)
#17 kudu::tablet::TabletReplicaTestBase::RestartReplica(bool)
Previous read of size 8 at 0x7b4400027688 by thread T20 (mutexes: write
M1047222376632167904):
#0 std::__1::vector<kudu::MemTracker*,
std::__1::allocator<kudu::MemTracker*> >::end()
#1 kudu::MemTracker::Release(long)
#2 kudu::tablet::OpTracker::Release(kudu::tablet::OpDriver*)
#3 kudu::tablet::OpDriver::Finalize()
#4 kudu::tablet::OpDriver::ApplyTask()
#5 kudu::tablet::OpDriver::ApplyAsync()::$_2::operator()()
...
[1]
http://dist-test.cloudera.org:8080/test_drilldown?test_name=txn_participant-test
Change-Id: I993015bf73ad8fe84a864b8b3c030e1be00e26e0
Reviewed-on: http://gerrit.cloudera.org:8080/22612
Reviewed-by: Abhishek Chennaka <[email protected]>
Reviewed-by: Marton Greber <[email protected]>
Tested-by: Marton Greber <[email protected]>
---
src/kudu/tablet/tablet_replica.cc | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/src/kudu/tablet/tablet_replica.cc
b/src/kudu/tablet/tablet_replica.cc
index e89d3c65e..3fe8608cd 100644
--- a/src/kudu/tablet/tablet_replica.cc
+++ b/src/kudu/tablet/tablet_replica.cc
@@ -344,13 +344,24 @@ void TabletReplica::Stop() {
if (consensus_) consensus_->Stop();
+ // First, close the prepare pool token, so no new operations can be accepted
+ // by the replica, and then start waiting for existing in-flight
+ // operations to complete. Otherwise, there would be a race condition
+ // if the token were still active after returning
+ // from op_tracker_.WaitForAllToFinish() call.
+ if (prepare_pool_token_) {
+ prepare_pool_token_->Close();
+ }
+
// TODO(KUDU-183): Keep track of the pending tasks and send an "abort"
message.
LOG_SLOW_EXECUTION(WARNING, 1000,
Substitute("TabletReplica: tablet $0: Waiting for Ops to complete",
tablet_id())) {
op_tracker_.WaitForAllToFinish();
}
-
if (prepare_pool_token_) {
+ // In debug builds, make sure no queued operations are still pending.
+ DCHECK(prepare_pool_token_->WaitFor(MonoDelta::FromSeconds(0)));
+ // Explicitly shutdown the token.
prepare_pool_token_->Shutdown();
}