trikker opened a new issue, #2214:
URL: https://github.com/apache/brpc/issues/2214
**Describe the bug (描述bug)**
我们项目中用了0.9.7的brpc,代码是前同事写的,我在编译后运行的时候遇到了double free问题,报错如下,这个报错很奇怪,Succeeded
to remove the node居然打印了2次, 但是Starts to remove timeout node只打了一次。
```
E0329 19:17:15.666198 1758 /projectcode/replication/log_replicator.cc:641]
Starts to remove timeout node=10.99.122.213:6211
E0329 19:17:15.666200 1758 /projectcode/replication/log_replicator.cc:641]
Succeeded to remove the node in replication group=10.99.122.213:6211
E0329 19:17:15.666203 1758 /projectcode/replication/log_replicator.cc:641]
Succeeded to remove the node in replication group=^P3
$_^?^@^@2.213:6211
free(): double free detected in tcache 2
```
crash的堆栈如下,看是std::string重复析构了,但是不确认是removed_ip重复析构还是r->options_.peer_id.ToString()重复析构了。
```
(gdb) bt
#0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at
../sysdeps/unix/sysv/linux/pthread_kill.c:56
#1 0x00005558e5e77e89 in write_core (sig=sig@entry=6) at
/projectcode/stacktrace.cc:306
#2 0x00005558e512660d in handle_fatal_signal (sig=6) at
/projectcode/signal.cc:169
#3 <signal handler called>
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#5 0x00007ff92e801537 in __GI_abort () at abort.c:79
#6 0x00007ff92e85a768 in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0x7ff92e9783a5 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#7 0x00007ff92e861a5a in malloc_printerr (str=str@entry=0x7ff92e97a6f0
"free(): double free detected in tcache 2") at malloc.c:5347
#8 0x00007ff92e863055 in _int_free (av=0x7ff27c000020, p=0x7ff27c032400,
have_lock=0) at malloc.c:4201
#9 0x00005558e4bf63e8 in std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >::~basic_string
(this=0x7ff2b07f8ea0, __in_chrg=<optimized out>) at
/usr/include/c++/10/bits/basic_string.h:657
#10 xxxmodule::LogReplicator::RemoveTimeoutNode (args=<optimized out>) at
/projectcode/replication/log_replicator.cc:620
#11 0x00007ff92fa9088f in bthread::TaskGroup::task_runner
(skip_remained=<optimized out>) at
/projectcode/deps/brpc-0.9.7/src/bthread/task_group.cpp:295
#12 0x00007ff92fa77821 in bthread_make_fcontext () from
/home/test/deps/lib/libbrpc.so
#13 0x0000000000000000 in ?? ()
```
最重要的2个函数如下,这2个函数的意思是当前节点超过一段时间没收到某个节点的rpc回复,就用bthread_start_background调用RemoveTimeoutNode函数将其踢掉。
下面的std::string removed_ip = r->options_.peer_id.ToString()是620行
```
void* LogReplicator::RemoveTimeoutNode(void* args) {
LogReplicator* r = (LogReplicator*)args;
std::string removed_ip = r->options_.peer_id.ToString();
LOG(INFO) << "Starts to remove timeout node=" << removed_ip;
std::string conf_str(opt_replication_local_address);
int ret = MyNode::Instance()->RemovePeer(conf_str, removed_ip, true);
if (ret != 0) {
LOG(ERROR) << "Failed to remove the node in replication group=" <<
removed_ip;
} else {
LOG(ERROR) << "Succeeded to remove the node in replication group="
<< removed_ip;
}
}
void LogReplicator::OnHeartbeatReturned(ReplicatorId id, brpc::Controller*
cntl,
AppendLogsRequest* request,
AppendLogsResponse* response,
int64_t rpc_send_time) {
std::unique_ptr<brpc::Controller> cntl_guard(cntl);
std::unique_ptr<AppendLogsRequest> req_guard(request);
std::unique_ptr<AppendLogsResponse> res_guard(response);
LogReplicator *r = nullptr;
bthread_id_t dummy_id = { id };
const long start_time_us = butil::gettimeofday_us();
if (bthread_id_lock(dummy_id, (void**)&r) != 0) {
return;
}
uint timeout = rpc_timeout_seconds;
if (!r->will_be_removed_ &&
((butil::monotonic_time_ms() - r->last_rpc_send_timestamp_) >=
(1000 * timeout))) {
r->will_be_removed_ = true;
bthread_t tid;
bthread_attr_t attr = BTHREAD_ATTR_NORMAL;
if (bthread_start_background(
&tid, &attr,
RemoveTimeoutNode,
r) != 0) {
LOG(ERROR) << "Fail to remove timeout node.";
}
}
if (cntl->Failed()) {
LOG_IF(WARNING, (++r->consecutive_error_times_) % 10 == 0)
<< " Fail to issue RPC to " << r->options_.peer_id
<< " consecutive_error_times=" << r->consecutive_error_times_ <<
", "
<< cntl->ErrorText();
r->StartHeartbeatTimer(start_time_us);
CHECK_EQ(0, bthread_id_unlock(dummy_id)) << "Fail to unlock " <<
dummy_id;
return;
}
r->consecutive_error_times_ = 0;
r->UpdateLastRPCSendTimestamp(rpc_send_time);
r->StartHeartbeatTimer(start_time_us);
if (request->has_recv_start_num()) {
r->heartbeat_stage_ = 1;
} else if (request->has_empty_dummy()) {
r->slave_info.applied_num = response->applied_num();
r->slave_info.lazy_applied_num = response->lazy_applied_num();
global_num_manager->RecordReplicaInfo(response->applied_num(),
response->lazy_applied_num(),
r->index_);
} else if( response->has_applied_num(){
if(unlikely(r->slave_preparing_)){
r->slave_preparing_ = false;
assert(r->thd_);
r->thd_->mdl_context.release_transactional_locks();
LOG(INFO) << "Replicator=" << r->id_ << "@" <<
r->options_.peer_id
<< " success to release mdl lock";
}
r->slave_info.applied_num = response->applied_num();
r->slave_info.lazy_applied_num = response->lazy_applied_num();
global_num_manager->RecordReplicaInfo(response->applied_num(),
response->lazy_applied_num(),
r->index_);
// r->Processnum(&r->slave_info);
}
CHECK_EQ(0, bthread_id_unlock(dummy_id)) << "Fail to unlock " <<
dummy_id;
}
```
上面的MyNode::Instance()->RemovePeer函数我可以确认不会对removed_ip和conf_str做修改,RemovePeer函数用的是引用传递,我改成值传递还是会报double
free的问题。
PeerId的ToString方法是这样的:
```
struct PeerId {
...
...
std::string ToString() const {
return std::string(butil::endpoint2str(addr).c_str());
}
```
butil::endpoint2str没有改动,用的是brpc的库的代码。
```
EndPointStr endpoint2str(const EndPoint& point) {
EndPointStr str;
if (ExtendedEndPoint::is_extended(point)) {
ExtendedEndPoint* eep = ExtendedEndPoint::address(point);
if (eep) {
eep->to(&str);
} else {
str._buf[0] = '\0';
}
return str;
}
if (inet_ntop(AF_INET, &point.ip, str._buf, INET_ADDRSTRLEN) == NULL) {
return endpoint2str(EndPoint(IP_NONE, 0));
}
char* buf = str._buf + strlen(str._buf);
*buf++ = ':';
snprintf(buf, 16, "%d", point.port);
return str;
}
```
EndPointStr的定义:
```
struct EndPointStr {
const char* c_str() const { return _buf; }
char _buf[sizeof("unix:") + sizeof(sockaddr_un::sun_path)];
};
```
**To Reproduce (复现方法)**
每次上面的代码必现,这里代码太多没法给出复现的方法。
有几个点非常值得注意:
(1)用release模式编译运行一定会出问题,用debug编译一定不会出问题;
(2)用-O0编译一定不会出问题,用-O1、-O2编译一定出问题;
(3)在debian10、11编译一定出问题(gcc版本10.2.1-6),在debian9编译(gcc 6.3.0-18)一定不出问题;
(4)我尝试gcc、g++添加一些编译参数,发现加-fPIE编译一定不出问题,很奇怪;
(5)我升级过brpc的代码到1.2,还是有问题;
**Expected behavior (期望行为)**
**Versions (各种版本)**
OS: Debian11
Compiler: gcc/g++ 10.2.1-6
brpc: 0.9.7
protobuf: 3.6.1
**Additional context/screenshots (更多上下文/截图)**
AddressSanitizer定位double free问题的输出:
```
==2290197==ERROR: AddressSanitizer: attempting double-free on 0x6030004f5ac0
in thread T62:
#0 0x7f5b8f3d5017 in operator delete(void*)
../../../../src/libsanitizer/asan/asan_new_delete.cpp:160
#1 0x5633a04a5d9a in xxxmodule::LogReplicator::RemoveTimeoutNode(void*)
/projectcode/log_replicator.cc:620
#2 0x5633a36cb7c6 in bthread::TaskGroup::task_runner(long)
/projectcode/brpc-0.9.7/src/bthread/task_group.cpp:295
#3 0x5633a383ed50 in bthread_make_fcontext
(/usr/local/mybinary/bin/xxxd+0x71d1d50)
0x6030004f5ac0 is located 0 bytes inside of 18-byte region
[0x6030004f5ac0,0x6030004f5ad2)
freed by thread T62 here:
#0 0x7f5b8f3d5017 in operator delete(void*)
../../../../src/libsanitizer/asan/asan_new_delete.cpp:160
previously allocated by thread T27 here:
#0 0x7f5b8f3d4647 in operator new(unsigned long)
../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
Thread T62 created by T0 here:
#0 0x7f5b8f37e2a2 in __interceptor_pthread_create
../../../../src/libsanitizer/asan/asan_interceptors.cpp:214
#1 0x5633a36c040b in bthread::TaskControl::add_workers(int)
/projectcode/brpc-0.9.7/src/bthread/task_control.cpp:199
#2 0x5633a36b27bc in bthread_setconcurrency
/projectcode/brpc-0.9.7/src/bthread/bthread.cpp:310
#3 0x5633a379f8e2 in brpc::Server::StartInternal(butil::EndPoint const&,
brpc::PortRange const&, brpc::ServerOptions const*)
/projectcode/brpc-0.9.7/src/brpc/server.cpp:920
#4 0x5633a37a0c8c in brpc::Server::Start(butil::EndPoint const&,
brpc::ServerOptions const*) /projectcode/brpc-0.9.7/src/brpc/server.cpp:1083
#5 0x5633a37a0e27 in brpc::Server::Start(int, brpc::ServerOptions
const*) /projectcode/brpc-0.9.7/src/brpc/server.cpp:1102
#6 0x5633a04d0cec in
xxxmodule::MyNode::Start(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >
const&) /projectcode/my_node.cc:54
#7 0x5633a0385a08 in xyz
Thread T27 created by T0 here:
#0 0x7f5b8f37e2a2 in __interceptor_pthread_create
../../../../src/libsanitizer/asan/asan_interceptors.cpp:214
#1 0x5633a36c040b in bthread::TaskControl::add_workers(int)
/projectcode/brpc-0.9.7/src/bthread/task_control.cpp:199
#2 0x5633a36b27bc in bthread_setconcurrency
/projectcode/brpc-0.9.7/src/bthread/bthread.cpp:310
#3 0x5633a379f8e2 in brpc::Server::StartInternal(butil::EndPoint const&,
brpc::PortRange const&, brpc::ServerOptions const*)
/projectcode/brpc-0.9.7/src/brpc/server.cpp:920
#4 0x5633a37a0c8c in brpc::Server::Start(butil::EndPoint const&,
brpc::ServerOptions const*) /projectcode/brpc-0.9.7/src/brpc/server.cpp:1083
#5 0x5633a37a0e27 in brpc::Server::Start(int, brpc::ServerOptions
const*) /projectcode/brpc-0.9.7/src/brpc/server.cpp:1102
#6 0x5633a04d0cec in
xxxmodule::MyNode::Start(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >
const&) /projectcode/my_node.cc:54
#7 0x5633a0385a08 in xyz
SUMMARY: AddressSanitizer: double-free
../../../../src/libsanitizer/asan/asan_new_delete.cpp:160 in operator
delete(void*)
==2290197==ABORTING
```
目前怀疑brpc的bthread这个地方有问题,好像多个bthread同时执行这个LogReplicator::RemoveTimeoutNode函数,但是从日志输出看又只是函数的后半部分执行了(包括析构),非常诡异。
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]