lengyuexuexuan opened a new pull request, #2368:
URL: https://github.com/apache/incubator-pegasus/pull/2368
### What problem does this PR solve? <!--add issue link with summary if
exists-->
#2211
### What is changed and how does it work?
1. Background
During duplication of a table, if the commands `dup remove/pause` are
executed or `a balance operation` is performed at the same time, there is a
chance that a node may core dump with signal `ID 11`. The core dump locations
vary, but they all have one thing in common: they occur during memory
allocation or deallocation.
2. Analysis
Based on extensive testing, the following conclusions can be drawn:
a. The issue only reproduces when there is write traffic. The difference
between having and not having write traffic is: It adds the ship and
load_private_log tasks.
b. The core dump occurs during the execution of `cancel_all()`.
c. The issue occurs with low probability (approximately 1 in 100).
Through analysis using ASAN (AddressSanitizer):
[dup_remove_asan.txt](https://github.com/user-attachments/files/19956043/dup_remove_asan.txt)
Based on ASAN analysis, the following conclusions can be drawn:
a. The memory corruption occurs during the ship process. The mutations
obtained from replaying the plog are passed to ship, leading to the issue.
b. `_load_mutations` is captured by a lambda expression and then passed
to a `std::function`. Since `std::move` is used, the lifetime of
`_load_mutations` is tied to that of the `std::function`.
c. The `cancel_all()` function is executed in the default thread pool.
At this point, the following function is called. When the `std::function` is
set to nullptr, it will release the memory it
manages.https://github.com/apache/incubator-pegasus/blob/e64faa7d03f836e5212874bbbfb8f900341897d7/src/task/task.h#L341
d. However, each task executes `exec_internal()` in its own thread pool,
and eventually calls `release_ref()`, which results in delete this.
https://github.com/apache/incubator-pegasus/blob/e64faa7d03f836e5212874bbbfb8f900341897d7/src/task/task.cpp#L224
3. Conclusion
**1. Both `task.cancel()` and `task.exec_internal()` destruct the
std::function member inside the task object. These two operations are executed
in different threads, and there is no mechanism in place to prevent race
conditions between them. As a result, it is possible for both threads to
attempt to destruct the same std::function, which can lead to a double deletion
of the memory associated with _load_mutations. This ultimately causes memory
corruption.
2. `_duplications` is accessed without proper synchronization in certain
functions under multi-threaded scenarios, potentially causing race conditions.**
4. Solution
1. Lock the `_cb` callback to ensure that only one thread executes its
destructor.
2. Add locking to functions that access `_duplications` without
synchronization to prevent concurrent access conflicts.
##### Tests <!-- At least one of them must be included. -->
- Manual test
**The changes have been production-validated at Xiaomi, running stably on
more than 30 clusters for over six months, confirming that they resolve the
concurrency issues described above.**
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]