Hi Doris community,
We're an engineering team at Cylake running Doris on a high-throughput security analytics platform. At sustained 400–500K EPS (group_commit sync_mode, DUPLICATE KEY table, 20 BEs), we've hit a reproducible BE crash traced to a known race in brpc 1.4.0. It's reproducible enough to be a recurring issue for us. Issue: ``` Segmentation fault (core dumped) 0# doris::signal::(anonymous namespace)::FailureSignalHandler at be/src/common/signal_handler.h:420 1# PosixSignals::chained_handler in /usr/lib/jvm/java/lib/server/libjvm.so 2# JVM_handle_linux_signal in /usr/lib/jvm/java/lib/server/libjvm.so 3# 0x00007F9881299520 in /lib/x86_64-linux-gnu/libc.so.6 4# bvar::Reducer<long, bvar::detail::AddTo<long>, bvar::detail::MinusFrom<long>>::SeriesSampler::take_sample() at thirdparty/installed/include/bvar/reducer.h:79 5# bvar::detail::SamplerCollector::run() in /opt/apache-doris/be/lib/doris_be ``` This is reproducible on any BE exceeding ~15–20K EPS, with multiple BEs typically crashing within 30 minutes. Root cause: metadata_adder.h declares 28 named global bvar::Adder<int64_t> instances, each touched on every metadata-object construction and destruction (rowsets, segments, column readers, etc.). At high EPS these fire tens of thousands of times per second across many worker threads. The underlying AgentCombiner in brpc 1.4.0 has a race: when a thread exits while SamplerCollector is concurrently iterating the Agent list, it dereferences freed memory, causing a SIGSEGV. The fix already exists upstream. brpc PR #2949 (02703638aa9ac68b68350d025afc10e0b20d8371, merged April 17, 2025 — https://github.com/apache/brpc/pull/2949) converts _combiner to a shared_ptr, eliminating the race. Doris 4.0.4 pins brpc 1.4.0 ( thirdparty/vars.sh:208-209), which predates this fix. Our ask: We'd like to propose bumping the brpc pin in thirdparty/vars.sh to a release containing PR #2949. A version bump feels cleaner than carrying a Doris-side patch; it avoids drift from upstream and picks up other fixes landed since 1.4.0, though we recognize there are tradeoffs either way. We'd be glad to contribute the PR and welcome the community's feedback on the preferred path forward, including any thoughts on a target brpc version. Thanks for building a great system. We are looking forward to the discussion. Best Regards, Venkata Kaushik Chaganti Engineering, Cylake
