bneradt opened a new issue, #12776:
URL: https://github.com/apache/trafficserver/issues/12776

   ## Problem                                                                   
                                                                                
                        
                                                                                
                                                                                
                        
   CentOS 7 ASF CI runs are failing with segmentation faults during regression 
test execution. The crashes occur frequently when running `traffic_server -K -R 
3`.                      
                                                                                
                                                                                
                        
   ## Investigation                                                             
                                                                                
                        
                                                                                
                                                                                
                        
   Using diagnostic logging, it was identified that the crash is caused by a 
use-after-free bug in the `DbgCtl` reference counting system. Despite careful 
design to handle static initialization/destruction order issues, the reference 
counting mechanism seems to not prevent a use after free of the std::map - at 
least in the older compiled version of ATS from CentOS 7.
   
   **The rest of the below is from Cursor. I include it in case it is 
helpful.** I'm not entirely convinced its root cause concerning one thread 
deleting the tag registry impacting the other threads is correct or not. It is 
clear, however, that simply not freeing the std::map registry does make the 
test stable.
   
   ## Root Cause
   
   ### The Reference Counting Issue
   
   The `DbgCtl` class uses a shared registry with reference counting:
   - Each `DbgCtl` object increments `registry_reference_count` on construction.
   - Each `DbgCtl` destructor calls `_rm_reference()`, which decrements the 
count.
   - When `ref_count` reaches 0, the registry (including the `std::map` of all 
tags) is deleted.
   
   **The fatal flaw:** The reference count tracks how many `DbgCtl` objects 
currently exist, but it cannot account for whether those objects will be 
accessed before they're destroyed.
   
   ### How the Crash Occurs
   
   1. **Multiple Threads/Contexts**: The application has DbgCtl objects across 
multiple threads. Some are static/global, some are thread-local, and some may 
be in function scopes.
   
   2. **Thread Exit During Execution**: When a thread exits mid-execution (not 
just at program exit), all its DbgCtl objects destruct:
      ```
      Thread A exits → its DbgCtl objects destruct → _rm_reference() called 
repeatedly
      ref_count: 17 → 16 → 15 → ... → 1 → 0
      ```
                                                                                
                                                                                
                        
   3. **Registry Deleted Prematurely**: When the last DbgCtl from Thread A 
destructs, `ref_count` hits 0, and the registry (with its `std::map` containing 
all 229 tag entries) is deleted.
   
   5. **Dangling Pointers in Other Threads**: Thread B is still running and has 
DbgCtl objects with `_ptr` pointing into the now-deleted registry:
   
   ```cpp
      bool DbgCtl::on() const {
        // ...
        if (!_ptr->second) {  // ← Accessing freed memory!
          return false;
        }
        // ...
      }
   ```                                       
   
   5. **Crash**: Thread B calls `.on()` or `.tag()` on its DbgCtl → accesses 
freed memory → segmentation fault.
   
   ### Evidence from Diagnostics
   
   From the diagnostic logs during crash reproduction:                          
             
   
   ```                                          
   Registry 1 (main traffic_server process):    
     - Created with ref_count starting at 0     
     - Grew to ref_count = 351 (351 DbgCtl objects created)                     
             
     - Contains 229 unique tags in the registry map                             
             
   
   During test execution (NOT at program exit):                                 
             
     ref_count: 17 → 16 → ... → 2 → 1 → 0       
     DEBUG: ~Registry() - deleting registry at 0xc6b930 with 229 entries
     DEBUG: ~Registry() - finished, registry deleted
   
     [Tests continue running...]                
   
     [CRASH - traffic_crashlog invoked]         
   ```                                          
   
   The crash happens **during active test execution**, not during static 
destruction at program exit. This proves that:
   - Some threads/scopes had their DbgCtl objects destruct
   - The registry was deleted when their ref_count contributions were removed
   - Other threads were still running with dangling `_ptr` pointers
   
   ### Why Reference Counting Cannot Solve This
   
   The reference counting is working **exactly as designed**, but the design 
cannot handle this scenario:
   
   - **What it tracks**: Number of currently existing DbgCtl objects            
             
   - **What it cannot track**: Whether those objects will be accessed before 
destruction     
   - **The gap**: When Thread A's DbgCtls destruct and ref_count hits 0, Thread 
B's DbgCtls still exist but their `_ptr` members now point to freed memory
   
   No amount of careful reference counting can solve this because:
   1. C++ provides no way to know if a pointer will be dereferenced in the 
future
   2. Thread exit order is non-deterministic
   3. Static destruction order across compilation units is undefined
   4. Some DbgCtl objects may be in shared libraries/plugins with different 
lifetimes
   
   ## Proposed Fix                              
   
   Use the Leaky Singleton pattern for the std::map registry and simply let the 
OS claim the resource on process shutdown. It has been verified that this 
addresses the crash.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to