Hi Doris community,

We're an engineering team at Cylake running Doris on a high-throughput
security analytics platform. At sustained 400–500K EPS (group_commit
sync_mode, DUPLICATE KEY table, 20 BEs), we've hit a reproducible BE crash
traced to a known race in brpc 1.4.0. It's reproducible enough to be a
recurring issue for us.

Issue:

```
Segmentation fault (core dumped)
0# doris::signal::(anonymous namespace)::FailureSignalHandler at
be/src/common/signal_handler.h:420
1# PosixSignals::chained_handler in /usr/lib/jvm/java/lib/server/libjvm.so
2# JVM_handle_linux_signal in /usr/lib/jvm/java/lib/server/libjvm.so
3# 0x00007F9881299520 in /lib/x86_64-linux-gnu/libc.so.6
4# bvar::Reducer<long, bvar::detail::AddTo<long>,
bvar::detail::MinusFrom<long>>::SeriesSampler::take_sample()
   at thirdparty/installed/include/bvar/reducer.h:79
5# bvar::detail::SamplerCollector::run() in
/opt/apache-doris/be/lib/doris_be
```

This is reproducible on any BE exceeding ~15–20K EPS, with multiple BEs
typically crashing within 30 minutes.




Root cause:
metadata_adder.h declares 28 named global bvar::Adder<int64_t> instances,
each touched on every metadata-object construction and destruction
(rowsets, segments, column readers, etc.). At high EPS these fire tens of
thousands of times per second across many worker threads. The underlying
AgentCombiner in brpc 1.4.0 has a race: when a thread exits while
SamplerCollector is concurrently iterating the Agent list, it dereferences
freed memory, causing a SIGSEGV.

The fix already exists upstream.
brpc PR #2949 (02703638aa9ac68b68350d025afc10e0b20d8371, merged April 17,
2025 — https://github.com/apache/brpc/pull/2949) converts _combiner to a
shared_ptr, eliminating the race. Doris 4.0.4 pins brpc 1.4.0 (
thirdparty/vars.sh:208-209), which predates this fix.

Our ask:
We'd like to propose bumping the brpc pin in thirdparty/vars.sh to a
release containing PR #2949. A version bump feels cleaner than carrying a
Doris-side patch; it avoids drift from upstream and picks up other fixes
landed since 1.4.0, though we recognize there are tradeoffs either way.
We'd be glad to contribute the PR and welcome the community's feedback on
the preferred path forward, including any thoughts on a target brpc version.


Thanks for building a great system. We are looking forward to the
discussion.


Best Regards,
Venkata Kaushik Chaganti
Engineering, Cylake

Reply via email to