vchag commented on PR #64040:
URL: https://github.com/apache/doris/pull/64040#issuecomment-4651161334

   > Thank you for your contribution to Apache Doris. Don't know what should be 
done next? See [How to process your 
PR](https://cwiki.apache.org/confluence/display/DORIS/How+to+process+your+PR).
   > 
   > Please clearly describe your PR:
   > 
   > 1. What problem was fixed (it's best to include specific error reporting 
information). How it was fixed.
   BE nodes crash with a segmentation fault (SIGSEGV) under sustained 
high-throughput ingestion. The crash occurs inside 
bvar::SamplerCollector::run() and is caused by a race condition in brpc 1.4.0's 
AgentCombiner: when a thread exits while SamplerCollector is iterating the 
agent list, it dereferences already-freed memory.
   
   At high EPS, the 28 global bvar::Adder<int64_t> instances in 
metadata_adder.h are updated tens of thousands of times per second across many 
worker threads, making this race reliably reproducible. Any single BE exceeding 
~15–20K EPS is at risk, and multiple BEs typically crash within 30 minutes.
   
   The fix (backport of https://github.com/apache/brpc/pull/2949) replaces the 
raw back-pointer from Agent to AgentCombiner with a weak_ptr, and makes the 
owning classes hold the combiner via
   shared_ptr. The agent destructor now calls combiner.lock() — if the combiner 
is already destroyed, lock() returns null and the destructor safely no-ops, 
eliminating
   the use-after-free.
   
   
   
   > 2. Which behaviors were modified. What was the previous behavior, what is 
it now, why was it modified, and what possible impacts might there be.
   
   The fix (backport of https://github.com/apache/brpc/pull/2949) replaces the 
raw back-pointer from Agent to AgentCombiner with a weak_ptr, and makes the 
owning classes hold the combiner via
   shared_ptr. The agent destructor now calls combiner.lock() — if the combiner 
is already destroyed, lock() returns null and the destructor safely no-ops, 
eliminating
   the use-after-free.
   
   
   > 3. What features were added. Why was this function added?
   No, feature or functionality updated. Consider a thirdparty library update. 
   
   > 4. Which code was refactored and why was this part of the code refactored?
   Released a new thirdparty patch to backport a changes made to brpc library 
to address the  bug described above. 
   
   > 5. Which functions were optimized and what is the difference before and 
after the optimization?
   These changes make sure the Doris continues to function as expect under high 
ingestion rate (400-500).
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to