ojalberts-itc commented on issue #64708: URL: https://github.com/apache/doris/issues/64708#issuecomment-4772890573
## Update — retested on 4.1.2; could not reproduce on a fresh install with the SG correct from t=0 (on either 4.1.2 *or* 4.0.6) **TL;DR.** We upgraded to **Apache Doris 4.1.2** and re-ran our reproduction. 4.1.2 did **not** wedge. But when we ran the **identical** test against a **fresh 4.0.6** cluster — the exact build this issue was filed against — it **also did not wedge**. The decisive difference from our original incident is not the Doris version: it is that both new clusters were built **from clean infrastructure with our cause-#1 fix (the BE→BE `webserver_port`/8040 self-ingress) present from t=0**. We can no longer reproduce the brpc load-stream wedge on a freshly-provisioned, correctly-configured cluster. We therefore **cannot yet attribute the clean 4.1.2 result to an upstream fix** — and we want to be honest about that rather than report a false "fixed". ### What we did - Re-deployed the same coupled-mode topology (3 FE + 4 BE, 8 vCPU / 64 GiB nodes, repl=3, local gp3 BE storage) on **fresh instances with fresh empty data volumes**, first on **4.1.2** (`apache-doris-4.1.2-bin-x64.tar.gz`, build tag `doris-4.1.2-rc01`), then a **control** on **4.0.6** (`doris-4.0.6-rc02-…`, the build this issue was filed against). - Held everything but the Doris version constant: same infrastructure, same SG rules (incl. the cause-#1 8040 self-ingress **present from first boot**), `enable_brpc_connection_check=true` in `be.conf`, and **`experimental_enable_single_replica_insert` left unset** so the cause-#3 multi-replica path is fully exercised. ### How we tested (identical protocol on both versions) The workload is the heaviest multi-replica writer we have: build 4 native UNIQUE-KEY merge-on-write tables (`replication_num=3`, `DISTRIBUTED BY HASH(...) BUCKETS 16`) by `INSERT … SELECT` from an Iceberg (S3 Tables) external catalog plus `UPDATE … FROM` maintenance passes — the same shape of load that originally wedged us. 1. **Sustained build** — ~26 min of continuous repl=3 writes, ~222M rows across the 4 tables. 2. **Idle-watch** — a repl=3 write probe every ~40 s for 16 min after the build, covering the 8–11 min post-load window where our original wedge fired. 3. **Rapid sequential churn** — 60 fast repl=3 multi-replica `INSERT`s (1–3 s each), the original "fast small load" shape. 4. **Concurrent churn** — 4 parallel writers (~100 concurrent repl=3 loads), the concurrent shape that wedged us once before. Per-BE instrumentation each run (the same signals from the original capture): | Signal | Original wedge (4.0.6) | This retest — 4.1.2 | This retest — 4.0.6 control | | --- | --- | --- | --- | | `open_load_stream` handler entries during window | **0** (handler never entered) | ~5,100 cluster-wide | ~4,000 cluster-wide | | `[E1008]Reached timeout … @<be>:8060` (incl. loopback) | present | **0** | **0** | | brpc `:8060` "Broken" / never-revived socket | present | **0** | **0** | | `SHOW BACKENDS` Alive | true (misleading) | true | true | | Write path | wedged (full BE restart to recover) | healthy throughout | healthy throughout | On 4.1.2 the `open_load_stream` handler was entered ~5,100 times with zero brpc timeouts and zero broken sockets — a clean inversion of the original signature (where the handler was entered **zero** times during the stall and `[E1008]` loopback timeouts dominated). The 4.0.6 control behaved the same. ### Why we think it didn't reproduce In our original incident the reliable wedges all occurred on clusters carrying **accumulated degradation history**: the cause-#1 8040 self-ingress was *missing* at first, which drove an unbounded clone-repair storm, and we had also cycled single-BE restarts. The brpc load-stream socket went "Broken" against that degraded backdrop. On a **fresh** cluster with the 8040 rule in place from t=0, the clone storm never happens and we have not been able to drive the BE into the "Broken socket" state — on **either** 4.0.6 or 4.1.2. This suggests cause #3, as we observed it, may not be reachable independently of the cause-#1 degradation, rather than being a pure function of the steady-state multi-replica write workload. ### Honest limitations - **Scale.** This retest was at POC scale (one reporting period, 4 of our domains, ~222M output rows, ~26 min). Our production load is ~50× larger and runs for hours. Because the wedge is an *accumulation* failure, we cannot rule out that it only manifests under the full-scale, many-hour sustained load — which neither cluster has yet seen. - **Non-determinism.** Even originally on 4.0.6 the wedge was intermittent (one fresh-cluster build ran clean back then too), so a single clean pass is not proof of a fix. ### Where this leaves the issue We are **not** claiming 4.1.2 fixes the underlying brpc behaviour — we could not exercise the defect to test that claim. What we can say is: a fresh, correctly-configured cluster did not reproduce the wedge on 4.0.6 or 4.1.2 under heavy, idle, rapid, and concurrent multi-replica load. If it would help triage, our next steps would be (a) deliberately recreating the degraded precondition (8040 ingress absent → clone storm → single-BE restart) to drive a 4.0.6 wedge, then re-testing 4.1.2 under those same conditions, and/or (b) a full production-scale sustained load. Happy to run either and report back, and to pull fresh `pstack` / brpc `/vars` / `/rpcz` dumps if you want them from any of these runs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
