lengyuexuexuan opened a new pull request, #2368:
URL: https://github.com/apache/incubator-pegasus/pull/2368

   ### What problem does this PR solve? <!--add issue link with summary if 
exists-->
   #2211 
   
   ### What is changed and how does it work?
   1. Background   
       During duplication of a table, if the commands `dup remove/pause` are 
executed or `a balance operation` is performed at the same time, there is a 
chance that a node may core dump with signal `ID 11`. The core dump locations 
vary, but they all have one thing in common: they occur during memory 
allocation or deallocation.
   2. Analysis 
       Based on extensive testing, the following conclusions can be drawn:
       a. The issue only reproduces when there is write traffic. The difference 
between having and not having write traffic is: It adds the ship and 
load_private_log tasks.  
       b. The core dump occurs during the execution of `cancel_all()`.
       c. The issue occurs with low probability (approximately 1 in 100).
      
       Through analysis using ASAN (AddressSanitizer): 
   
[dup_remove_asan.txt](https://github.com/user-attachments/files/19956043/dup_remove_asan.txt)
       Based on ASAN analysis, the following conclusions can be drawn:
       a. The memory corruption occurs during the ship process. The mutations 
obtained from replaying the plog are passed to ship, leading to the issue.
       b. `_load_mutations` is captured by a lambda expression and then passed 
to a `std::function`. Since `std::move` is used, the lifetime of 
`_load_mutations` is tied to that of the `std::function`.
       c. The `cancel_all()` function is executed in the default thread pool. 
At this point, the following function is called. When the `std::function` is 
set to nullptr, it will release the memory it 
manages.https://github.com/apache/incubator-pegasus/blob/e64faa7d03f836e5212874bbbfb8f900341897d7/src/task/task.h#L341
       d. However, each task executes `exec_internal()` in its own thread pool, 
and eventually calls `release_ref()`, which results in delete this. 
https://github.com/apache/incubator-pegasus/blob/e64faa7d03f836e5212874bbbfb8f900341897d7/src/task/task.cpp#L224
   
   3. Conclusion  
       **1. Both `task.cancel()` and `task.exec_internal()` destruct the 
std::function member inside the task object. These two operations are executed 
in different threads, and there is no mechanism in place to prevent race 
conditions between them. As a result, it is possible for both threads to 
attempt to destruct the same std::function, which can lead to a double deletion 
of the memory associated with _load_mutations. This ultimately causes memory 
corruption.
       2. `_duplications` is accessed without proper synchronization in certain 
functions under multi-threaded scenarios, potentially causing race conditions.**
   
   4. Solution  
      1.  Lock the `_cb` callback to ensure that only one thread executes its 
destructor.
      2. Add locking to functions that access `_duplications` without 
synchronization to prevent concurrent access conflicts.
     
   ##### Tests <!-- At least one of them must be included. -->
   - Manual test 
   **The changes have been production-validated at Xiaomi, running stably on 
more than 30 clusters for over six months, confirming that they resolve the 
concurrency issues described above.**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to