This is an automated email from the ASF dual-hosted git repository.

xyz pushed a commit to branch branch-2.9
in repository https://gitbox.apache.org/repos/asf/pulsar.git

commit 869afe4984eab7eb1b525ad4cc468592d2a8afb2
Author: Yunze Xu <[email protected]>
AuthorDate: Thu Mar 24 03:59:41 2022 +0800

    [C++] Fix the race condition of connect timeout task (#14823)
    
    Fixes #14665
    
    ### Motivation
    
    In C++ client, a connect timeout task is created each time before an
    asynchronous connect operation is performed, if the connection cannot be
    established in the configured timeout, the callback of the task will be
    called to close the connection and then the `createProducer` or
    `subscribe` methods will return `ResultConnectError`.
    
    `ClientConnection::connectTimeoutTask_`, which is a shared pointer,
    represents the timeout task. However, after `ClientConnection::close` is
    called, the shared pointer will be reset, and the underlying `PeriodicTask`
    object will be released. After that, when `stop` method is called on the
    released `PeriodicTask` object in the callback (`handleTcpConnected`), a
    segmentation fault will happen.
    
    The root cause is that `connectTimeoutTask_` can be accessed in two
    threads while one of them could release the memory. See #14665 for more
    explanations. This race condition leads to flaky Python tests as well,
    because we also have the similar test in Python tests. See
    
https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221
    
    So this PR might also fix #14714.
    
    ### Modifications
    
    Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After
    that, the `connectTimeoutTask_` will always points to the same
    `PeriodicTask` object, whose methods are thread safe.
    
    ### Verifying this change
    
    Execute the following command
    
    ```bash
    ./tests/main --gtest_filter='ClientTest.testConnectTimeout' 
--gtest_repeat=10
    ```
    
    to runs the `testConnectTimeout` for 10 times. In my local env, it never
    failed, while before applying this patch, it's very easy to fail.
    
    (cherry picked from commit 0c3aad1e0ba0ee53784b963a1238d3d76b6dd8b2)
---
 pulsar-client-cpp/lib/ClientConnection.cc | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/pulsar-client-cpp/lib/ClientConnection.cc 
b/pulsar-client-cpp/lib/ClientConnection.cc
index bcee5ca..efc3cd5 100644
--- a/pulsar-client-cpp/lib/ClientConnection.cc
+++ b/pulsar-client-cpp/lib/ClientConnection.cc
@@ -1549,10 +1549,7 @@ void ClientConnection::close(Result result) {
         consumerStatsRequestTimer_.reset();
     }
 
-    if (connectTimeoutTask_) {
-        connectTimeoutTask_->stop();
-        connectTimeoutTask_.reset();
-    }
+    connectTimeoutTask_->stop();
 
     lock.unlock();
     LOG_INFO(cnxString_ << "Connection closed");

Reply via email to