Gmymymy opened a new pull request, #3306:
URL: https://github.com/apache/brpc/pull/3306
## Problem
When a bvar object (`Reducer`, `IntRecorder`, `Percentile`) is destroyed
while another thread is concurrently writing to it via `operator<<`, the
following crash occurs:
```
terminate called after throwing an instance of 'std::bad_weak_ptr'
what(): bad_weak_ptr
Aborted (core dumped)
```
This has been observed in high-concurrency RDMA performance testing
(see #3288).
## Root Cause
`AgentCombiner` inherits from `std::enable_shared_from_this`.
In `get_or_create_tls_agent()`, when an agent's `combiner` weak_ptr has
expired (the previous combiner was destroyed and the slot was reused), the
method calls `this->shared_from_this()` to re-bind the agent to the
current combiner.
`shared_from_this()` throws `std::bad_weak_ptr` if the object is not
currently managed by any `shared_ptr`. This can happen in a race where:
1. The `AgentCombiner` is the last object keeping a bvar alive
2. Another thread releases the last `shared_ptr` to it (e.g., the owning
`Reducer` goes out of scope or is destroyed during program shutdown)
3. A third thread is simultaneously inside `get_or_create_tls_agent()`,
past the `agent->combiner.expired()` check, and calls `shared_from_this()`
on the now-unmanaged object → `bad_weak_ptr` → `terminate()`
## Fix
Wrap `shared_from_this()` in a `try/catch(std::bad_weak_ptr)`. When caught,
return `NULL` and silently skip the recording. This is safe: the metric is
being torn down, so dropping a write in flight is acceptable and far
preferable to crashing.
Also remove the `LOG(FATAL)` in the three `operator<<` callers that fire
when `get_or_create_tls_agent()` returns `NULL`:
- For **allocation failure**, `get_or_create_tls_agent()` already calls
`LOG(FATAL)` internally (which aborts); the outer `LOG(FATAL)` was
unreachable in that case.
- For the new **combiner-expired** path, calling `LOG(FATAL)` would
incorrectly abort the process for a benign race during teardown.
## Affected files
| File | Change |
|------|--------|
| `src/bvar/detail/combiner.h` | catch `bad_weak_ptr` in
`get_or_create_tls_agent()` |
| `src/bvar/reducer.h` | remove outer `LOG(FATAL)` from `operator<<` |
| `src/bvar/recorder.h` | remove outer `LOG(FATAL)` from `operator<<` |
| `src/bvar/detail/percentile.cpp` | remove outer `LOG(FATAL)` from
`operator<<` |
Fixes #3288
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]