bneradt opened a new issue, #12776:
URL: https://github.com/apache/trafficserver/issues/12776
## Problem
CentOS 7 ASF CI runs are failing with segmentation faults during regression
test execution. The crashes occur frequently when running `traffic_server -K -R
3`.
## Investigation
Using diagnostic logging, it was identified that the crash is caused by a
use-after-free bug in the `DbgCtl` reference counting system. Despite careful
design to handle static initialization/destruction order issues, the reference
counting mechanism seems to not prevent a use after free of the std::map - at
least in the older compiled version of ATS from CentOS 7.
**The rest of the below is from Cursor. I include it in case it is
helpful.** I'm not entirely convinced its root cause concerning one thread
deleting the tag registry impacting the other threads is correct or not. It is
clear, however, that simply not freeing the std::map registry does make the
test stable.
## Root Cause
### The Reference Counting Issue
The `DbgCtl` class uses a shared registry with reference counting:
- Each `DbgCtl` object increments `registry_reference_count` on construction.
- Each `DbgCtl` destructor calls `_rm_reference()`, which decrements the
count.
- When `ref_count` reaches 0, the registry (including the `std::map` of all
tags) is deleted.
**The fatal flaw:** The reference count tracks how many `DbgCtl` objects
currently exist, but it cannot account for whether those objects will be
accessed before they're destroyed.
### How the Crash Occurs
1. **Multiple Threads/Contexts**: The application has DbgCtl objects across
multiple threads. Some are static/global, some are thread-local, and some may
be in function scopes.
2. **Thread Exit During Execution**: When a thread exits mid-execution (not
just at program exit), all its DbgCtl objects destruct:
```
Thread A exits → its DbgCtl objects destruct → _rm_reference() called
repeatedly
ref_count: 17 → 16 → 15 → ... → 1 → 0
```
3. **Registry Deleted Prematurely**: When the last DbgCtl from Thread A
destructs, `ref_count` hits 0, and the registry (with its `std::map` containing
all 229 tag entries) is deleted.
5. **Dangling Pointers in Other Threads**: Thread B is still running and has
DbgCtl objects with `_ptr` pointing into the now-deleted registry:
```cpp
bool DbgCtl::on() const {
// ...
if (!_ptr->second) { // ← Accessing freed memory!
return false;
}
// ...
}
```
5. **Crash**: Thread B calls `.on()` or `.tag()` on its DbgCtl → accesses
freed memory → segmentation fault.
### Evidence from Diagnostics
From the diagnostic logs during crash reproduction:
```
Registry 1 (main traffic_server process):
- Created with ref_count starting at 0
- Grew to ref_count = 351 (351 DbgCtl objects created)
- Contains 229 unique tags in the registry map
During test execution (NOT at program exit):
ref_count: 17 → 16 → ... → 2 → 1 → 0
DEBUG: ~Registry() - deleting registry at 0xc6b930 with 229 entries
DEBUG: ~Registry() - finished, registry deleted
[Tests continue running...]
[CRASH - traffic_crashlog invoked]
```
The crash happens **during active test execution**, not during static
destruction at program exit. This proves that:
- Some threads/scopes had their DbgCtl objects destruct
- The registry was deleted when their ref_count contributions were removed
- Other threads were still running with dangling `_ptr` pointers
### Why Reference Counting Cannot Solve This
The reference counting is working **exactly as designed**, but the design
cannot handle this scenario:
- **What it tracks**: Number of currently existing DbgCtl objects
- **What it cannot track**: Whether those objects will be accessed before
destruction
- **The gap**: When Thread A's DbgCtls destruct and ref_count hits 0, Thread
B's DbgCtls still exist but their `_ptr` members now point to freed memory
No amount of careful reference counting can solve this because:
1. C++ provides no way to know if a pointer will be dereferenced in the
future
2. Thread exit order is non-deterministic
3. Static destruction order across compilation units is undefined
4. Some DbgCtl objects may be in shared libraries/plugins with different
lifetimes
## Proposed Fix
Use the Leaky Singleton pattern for the std::map registry and simply let the
OS claim the resource on process shutdown. It has been verified that this
addresses the crash.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]