This is an automated email from the ASF dual-hosted git repository.

mgreber pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git


The following commit(s) were added to refs/heads/master by this push:
     new e9b3dae78 KUDU-3651 fix race condition in TabletReplica::Stop()
e9b3dae78 is described below

commit e9b3dae78edd84b4630e77601f285476ecae2f38
Author: Alexey Serbin <[email protected]>
AuthorDate: Mon Mar 10 22:22:40 2025 -0700

    KUDU-3651 fix race condition in TabletReplica::Stop()
    
    This patch addresses a race condition in TabletReplica::Stop().  Before
    this patch, new operations might be accepted by a tablet replica right
    after calling OpTracker::WaitForAllToFinish() and before completing the
    shutdown of the replica's prepare pool token.
    
    The race has been manifesting itself at least as a flakiness in various
    test scenarios in txn_participant-test [1].  In one particular instance,
    the following TSAN warnings were issued while running the
    TxnParticipantTest.TestBeginCommitAnchorsOnFlush scenario:
    
    WARNING: ThreadSanitizer: data race (pid=4116)
      Write of size 8 at 0x7b4400027688 by main thread:
        #0 std::__1::__vector_base<kudu::MemTracker*, 
std::__1::allocator<kudu::MemTracker*> >::__destruct_at_end(kudu::MemTracker**)
        ...
        #3 std::__1::vector<kudu::MemTracker*, 
std::__1::allocator<kudu::MemTracker*> >::~vector()
        #4 kudu::MemTracker::~MemTracker() mem_tracker.cc:83:1
        ...
        #9 kudu::tablet::OpTracker::~OpTracker()
        #10 kudu::tablet::TabletReplica::~TabletReplica()
        ...
        #16 
scoped_refptr<kudu::tablet::TabletReplica>::reset(kudu::tablet::TabletReplica*)
        #17 kudu::tablet::TabletReplicaTestBase::RestartReplica(bool)
    
      Previous read of size 8 at 0x7b4400027688 by thread T20 (mutexes: write 
M1047222376632167904):
        #0 std::__1::vector<kudu::MemTracker*, 
std::__1::allocator<kudu::MemTracker*> >::end()
        #1 kudu::MemTracker::Release(long)
        #2 kudu::tablet::OpTracker::Release(kudu::tablet::OpDriver*)
        #3 kudu::tablet::OpDriver::Finalize()
        #4 kudu::tablet::OpDriver::ApplyTask()
        #5 kudu::tablet::OpDriver::ApplyAsync()::$_2::operator()()
        ...
    
    [1] 
http://dist-test.cloudera.org:8080/test_drilldown?test_name=txn_participant-test
    
    Change-Id: I993015bf73ad8fe84a864b8b3c030e1be00e26e0
    Reviewed-on: http://gerrit.cloudera.org:8080/22612
    Reviewed-by: Abhishek Chennaka <[email protected]>
    Reviewed-by: Marton Greber <[email protected]>
    Tested-by: Marton Greber <[email protected]>
---
 src/kudu/tablet/tablet_replica.cc | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/src/kudu/tablet/tablet_replica.cc 
b/src/kudu/tablet/tablet_replica.cc
index e89d3c65e..3fe8608cd 100644
--- a/src/kudu/tablet/tablet_replica.cc
+++ b/src/kudu/tablet/tablet_replica.cc
@@ -344,13 +344,24 @@ void TabletReplica::Stop() {
 
   if (consensus_) consensus_->Stop();
 
+  // First, close the prepare pool token, so no new operations can be accepted
+  // by the replica, and then start waiting for existing in-flight
+  // operations to complete. Otherwise, there would be a race condition
+  // if the token were still active after returning
+  // from op_tracker_.WaitForAllToFinish() call.
+  if (prepare_pool_token_) {
+    prepare_pool_token_->Close();
+  }
+
   // TODO(KUDU-183): Keep track of the pending tasks and send an "abort" 
message.
   LOG_SLOW_EXECUTION(WARNING, 1000,
       Substitute("TabletReplica: tablet $0: Waiting for Ops to complete", 
tablet_id())) {
     op_tracker_.WaitForAllToFinish();
   }
-
   if (prepare_pool_token_) {
+    // In debug builds, make sure no queued operations are still pending.
+    DCHECK(prepare_pool_token_->WaitFor(MonoDelta::FromSeconds(0)));
+    // Explicitly shutdown the token.
     prepare_pool_token_->Shutdown();
   }
 

Reply via email to