This is an automated email from the ASF dual-hosted git repository.
xyz pushed a commit to branch branch-2.10
in repository https://gitbox.apache.org/repos/asf/pulsar.git
The following commit(s) were added to refs/heads/branch-2.10 by this push:
new 8eefaff [C++] Fix the race condition of connect timeout task (#14823)
8eefaff is described below
commit 8eefaff2225b60f51aedf12a8414c2edf7627273
Author: Yunze Xu <[email protected]>
AuthorDate: Thu Mar 24 03:59:41 2022 +0800
[C++] Fix the race condition of connect timeout task (#14823)
Fixes #14665
### Motivation
In C++ client, a connect timeout task is created each time before an
asynchronous connect operation is performed, if the connection cannot be
established in the configured timeout, the callback of the task will be
called to close the connection and then the `createProducer` or
`subscribe` methods will return `ResultConnectError`.
`ClientConnection::connectTimeoutTask_`, which is a shared pointer,
represents the timeout task. However, after `ClientConnection::close` is
called, the shared pointer will be reset, and the underlying `PeriodicTask`
object will be released. After that, when `stop` method is called on the
released `PeriodicTask` object in the callback (`handleTcpConnected`), a
segmentation fault will happen.
The root cause is that `connectTimeoutTask_` can be accessed in two
threads while one of them could release the memory. See #14665 for more
explanations. This race condition leads to flaky Python tests as well,
because we also have the similar test in Python tests. See
https://github.com/apache/pulsar/blob/f7cbc1eb83ffd27b784d90d5d2dea8660c590ad2/pulsar-client-cpp/python/pulsar_test.py#L1207-L1221
So this PR might also fix #14714.
### Modifications
Remove `connectTimeoutTask_.reset()` in `ClientConnection::close`. After
that, the `connectTimeoutTask_` will always points to the same
`PeriodicTask` object, whose methods are thread safe.
### Verifying this change
Execute the following command
```bash
./tests/main --gtest_filter='ClientTest.testConnectTimeout'
--gtest_repeat=10
```
to runs the `testConnectTimeout` for 10 times. In my local env, it never
failed, while before applying this patch, it's very easy to fail.
(cherry picked from commit 0c3aad1e0ba0ee53784b963a1238d3d76b6dd8b2)
---
pulsar-client-cpp/lib/ClientConnection.cc | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/pulsar-client-cpp/lib/ClientConnection.cc
b/pulsar-client-cpp/lib/ClientConnection.cc
index cf12f29..79297f8 100644
--- a/pulsar-client-cpp/lib/ClientConnection.cc
+++ b/pulsar-client-cpp/lib/ClientConnection.cc
@@ -1551,10 +1551,7 @@ void ClientConnection::close(Result result) {
consumerStatsRequestTimer_.reset();
}
- if (connectTimeoutTask_) {
- connectTimeoutTask_->stop();
- connectTimeoutTask_.reset();
- }
+ connectTimeoutTask_->stop();
lock.unlock();
LOG_INFO(cnxString_ << "Connection closed");