vchag commented on PR #64040: URL: https://github.com/apache/doris/pull/64040#issuecomment-4651161334
> Thank you for your contribution to Apache Doris. Don't know what should be done next? See [How to process your PR](https://cwiki.apache.org/confluence/display/DORIS/How+to+process+your+PR). > > Please clearly describe your PR: > > 1. What problem was fixed (it's best to include specific error reporting information). How it was fixed. BE nodes crash with a segmentation fault (SIGSEGV) under sustained high-throughput ingestion. The crash occurs inside bvar::SamplerCollector::run() and is caused by a race condition in brpc 1.4.0's AgentCombiner: when a thread exits while SamplerCollector is iterating the agent list, it dereferences already-freed memory. At high EPS, the 28 global bvar::Adder<int64_t> instances in metadata_adder.h are updated tens of thousands of times per second across many worker threads, making this race reliably reproducible. Any single BE exceeding ~15–20K EPS is at risk, and multiple BEs typically crash within 30 minutes. The fix (backport of https://github.com/apache/brpc/pull/2949) replaces the raw back-pointer from Agent to AgentCombiner with a weak_ptr, and makes the owning classes hold the combiner via shared_ptr. The agent destructor now calls combiner.lock() — if the combiner is already destroyed, lock() returns null and the destructor safely no-ops, eliminating the use-after-free. > 2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be. The fix (backport of https://github.com/apache/brpc/pull/2949) replaces the raw back-pointer from Agent to AgentCombiner with a weak_ptr, and makes the owning classes hold the combiner via shared_ptr. The agent destructor now calls combiner.lock() — if the combiner is already destroyed, lock() returns null and the destructor safely no-ops, eliminating the use-after-free. > 3. What features were added. Why was this function added? No, feature or functionality updated. Consider a thirdparty library update. > 4. Which code was refactored and why was this part of the code refactored? Released a new thirdparty patch to backport a changes made to brpc library to address the bug described above. > 5. Which functions were optimized and what is the difference before and after the optimization? These changes make sure the Doris continues to function as expect under high ingestion rate (400-500). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
