ojalberts-itc commented on issue #64708:
URL: https://github.com/apache/doris/issues/64708#issuecomment-4778070098
## Update: the wedge is **not OS-related** — reproduced on Ubuntu 22.04
(Doris 4.0.6)
To exclude the operating system as a factor, we reproduced this wedge on a
**second, different OS**.
We previously reported it on **Amazon Linux 2023**; we have now reproduced
the *same* wedge, with
the *same* in-process signature, on **Ubuntu 22.04 LTS**, using **identical
infrastructure** — same
Terraform, instance types, topology, security groups (incl. the cause-#1
BE→BE `:8040` self-ingress
rule), `be.conf`/`fe.conf`, replication factor, and Doris build. **The only
variable changed is the
host OS.** The wedge is present on both. **The OS is not the cause** —
consistent with an upstream
brpc/Doris load-stream defect.
This separates **existence** (4.0.6 exhibits the wedge — now on two OSes)
from **trigger** (an
accumulated cluster under multi-replica load); the trigger evidence is
unchanged from the prior
4.0.6/4.1.2 reports.
### Version
`doris-4.0.6-rc02-1663f25c16f` — official x64 (AVX2) GA tarball. Build path
in the stacks is
`/home/zcp/repo_center/doris_release/doris/be/src/...`
(`ldb-toolchain-v0.26`, gcc 15).
### Environment (identical to the Amazon Linux run except the OS)
- 3 FE + 4 BE, AWS `r8i.2xlarge` (8 vCPU / 64 GiB), coupled
(storage-compute) mode, replication = **3**.
- `experimental_enable_single_replica_insert` = **false the whole time** (no
toggle — see confounds).
- **OS: Ubuntu 22.04.5 LTS, kernel 6.8.0-1057-aws** (vs Amazon Linux 2023,
kernel 6.1, in the prior run).
### Trigger / preconditions (repro-survivable)
Honest precondition, so a fresh-cluster smoke test does not "disprove" this:
native ingest and a
2.8B-row federated read **passed at deploy** (03:21–03:22 UTC), i.e. the
`:8060` load path worked
when the cluster was new. The wedge then appeared **~20 min later**, after
accumulated repl=3
cross-BE activity (native ingest + internal `audit_log` stream-loads + the
test's reads/writes) —
**before** any heavy bulk build. It is **persistent**: still wedged **~6
hours after onset** (a
repl=3 write canary wedged **4/6** at +6 h; ~6,000 `[E1008]` lines on
`:8060` in the first 25 min,
still ~212 per 15 min hours later). It is *intermittent at the cluster
level* (a given repl=3 write
wedges only when its tablet replica set includes a wedged BE) but
**persistent on the affected BEs**,
and it **spreads** BE→BE over time. This matches the known "accumulated
cluster" precondition;
existence on Ubuntu is the new fact here.
### The parked write thread — same `LoadStreamStub::open`, now on Ubuntu
Doris's own `be.WARNING` prints the stack at the failure point:
```text
open stream failed: [INTERNAL_ERROR]Failed to connect to backend <id>:
[E1008]Reached timeout=60000ms @10.0.0.133:8060
0# doris::LoadStreamStub::open(...)
be/src/vec/sink/load_stream_stub.cpp:208
1# doris::LoadStreamStubs::open(...)
be/src/vec/sink/load_stream_stub.cpp:574
2# doris::VTabletWriterV2::_open_streams_to_backend(...)
be/src/vec/sink/writer/vtablet_writer_v2.cpp:317
3# doris::VTabletWriterV2::_open_streams()
be/src/vec/sink/writer/vtablet_writer_v2.cpp:298
4# doris::VTabletWriterV2::open(...) → AsyncResultWriter::process_block
→ ThreadPool::dispatch_thread
```
### `[E1008]` on `:8060` across all 4 BEs (parks growing to 538s / 900s)
The broken load-stream socket is never revived; the cancel propagates as
`failed to write enough replicas`:
```text
load_stream_stub.cpp:369 LoadStreamStub ... src_id=...884, dst_id=...881 is
cancelled because of
[INTERNAL_ERROR]Failed to connect to backend ...881: [E110]...
Socket{...addr=10.0.0.43:8060...}: Connection timed out
load_stream_stub.cpp:591 open stream failed: ...[E112]Not connected to
10.0.0.43:8060 yet, server_id=217
vtablet_writer_v2.cpp:321 failed to open stream to backend ...881
brpc_closure.h:128 RPC meet failed: [E1008]Reached timeout=537999ms
@10.0.0.133:8060
brpc_closure.h:128 RPC meet failed: [E1008]Reached timeout=900000ms
@10.0.0.53:8060
fragment_mgr.cpp:1177/1198 brpc stub: 10.0.0.118:8060 check failed [E1008] →
remove brpc stub from cache
stream_load_executor.cpp:107 ... node=10.0.0.133:8060, open failed, failed
to open tablet writer,
error=RPC call is timed out, [E1008]Reached timeout=60000ms
@10.0.0.133:8060 ... elapse(s)=64
```
### Workers parked (not saturated), zero-clone, all `Alive=true`
brpc `:8060` `/vars` at the wedge: `bthread_worker_usage` **≈ 1.1** (of 256
workers),
`bthread_count` 5, `load_stream_count` 1–3 — threads blocked on the RPC, not
starved. FE
`SHOW PROC '/cluster_balance/{running,pending}_tablets'` was **empty**
(zero-clone — isolated from
the cause-#1 tablet-clone path). `SHOW BACKENDS` showed all 4 BEs
`Alive=true` throughout (the 9050
heartbeat is a separate threadpool). Only a **simultaneous full-BE restart**
clears it; a single-BE
restart does not.
### What we ruled out on Ubuntu, with positive evidence (captured while
wedged)
Each environment cause excluded by direct test on the wedged Ubuntu cluster
— so "it's your
network/OS" is answered up front:
- **Raw L4 to `:8060` is OPEN** — `cat >/dev/tcp/<peer>/8060` succeeds to
every peer; 11–14
**ESTABLISHED** `:8060` sockets per BE in `ss`. The stall is in the brpc
application layer, not TCP.
- **No host firewall** — `ufw` inactive; `nftables` ruleset empty;
`iptables` default-`ACCEPT` only.
- **conntrack not exhausted** — `nf_conntrack_count` 0 / max 1048576.
- **No Nitro/ENA throttle** — `bw_in/out_allowance_exceeded`,
`pps_allowance_exceeded`,
`conntrack_allowance_exceeded`, `linklocal_allowance_exceeded` all **0**
on all 4 BEs.
- **No SELinux** (Ubuntu); AppArmor enabled with **no Doris profile** loaded.
### Confounds (disclosed) and why OS is excluded
- **No mitigation-toggle confound** here: `single_replica_insert` was
`false` **from t=0**, never
toggled (the original 4.0.6 occurrence had a toggle ~2 min prior; this run
removes that variable).
- **OS is the only delta** between this run and the Amazon Linux run
(identical Terraform/config/build).
Two reproductions whose environments differ in exactly the OS → the OS is
not the cause.
- **`enable_brpc_connection_check=true` does not fix it** (it was on
throughout): it detects the
broken stub and evicts it from cache (`remove brpc stub from cache`), but
the socket never revives.
### What we expected
A BE-to-BE load-stream `brpc` socket that goes `Broken`/times out should be
detected and
**re-established** so writes recover without operator intervention; instead
it is never revived and
only a full-fleet BE restart clears it.
### Questions
1. With the OS now excluded by a second-OS reproduction (Ubuntu 22.04 +
Amazon Linux 2023, identical
infra), does this confirm the failure is in the brpc load-stream layer
rather than the environment?
2. Is this the brpc **#1168** class (socket `Broken`, never revived), and is
there a fixing PR or a
Doris version that resolves it?
3. Should `enable_brpc_connection_check` be able to recover this socket, or
is eviction-without-revive
the expected (insufficient) behaviour here?
## Willing to submit a PR
No, but **happy to test a patch** or run any targeted capture you want on
the live wedged cluster.
## What we captured / can provide
Attached `64708-ubuntu-4.0.6-evidence.tar.gz`
(sha256 `03896c90b393d8577f007ebffe69bb82b0831fb9ba1697d6a9dd000c7fab2950`):
per-BE `be.WARNING`
load-stream failures + brpc `:8060` `/vars`, the parked-stack backtrace, FE
cluster-state
(`Alive=true` + empty `cluster_balance`), the environment-exclusion probes,
and the persistence
timeline. Full per-BE `gdb 'thread apply all bt'` dumps, `/vars`, `/rpcz`,
and complete `be.WARNING`
are available on request.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]