ojalberts-itc commented on issue #64708:
URL: https://github.com/apache/doris/issues/64708#issuecomment-4772890573

   ## Update — retested on 4.1.2; could not reproduce on a fresh install with 
the SG correct from t=0 (on either 4.1.2 *or* 4.0.6)
   
   **TL;DR.** We upgraded to **Apache Doris 4.1.2** and re-ran our 
reproduction. 4.1.2 did **not** wedge. But when we ran the **identical** test 
against a **fresh 4.0.6** cluster — the exact build this issue was filed 
against — it **also did not wedge**. The decisive difference from our original 
incident is not the Doris version: it is that both new clusters were built 
**from clean infrastructure with our cause-#1 fix (the BE→BE 
`webserver_port`/8040 self-ingress) present from t=0**. We can no longer 
reproduce the brpc load-stream wedge on a freshly-provisioned, 
correctly-configured cluster. We therefore **cannot yet attribute the clean 
4.1.2 result to an upstream fix** — and we want to be honest about that rather 
than report a false "fixed".
   
   ### What we did
   
   - Re-deployed the same coupled-mode topology (3 FE + 4 BE, 8 vCPU / 64 GiB 
nodes, repl=3, local gp3 BE storage) on **fresh instances with fresh empty data 
volumes**, first on **4.1.2** (`apache-doris-4.1.2-bin-x64.tar.gz`, build tag 
`doris-4.1.2-rc01`), then a **control** on **4.0.6** (`doris-4.0.6-rc02-…`, the 
build this issue was filed against).
   - Held everything but the Doris version constant: same infrastructure, same 
SG rules (incl. the cause-#1 8040 self-ingress **present from first boot**), 
`enable_brpc_connection_check=true` in `be.conf`, and 
**`experimental_enable_single_replica_insert` left unset** so the cause-#3 
multi-replica path is fully exercised.
   
   ### How we tested (identical protocol on both versions)
   
   The workload is the heaviest multi-replica writer we have: build 4 native 
UNIQUE-KEY merge-on-write tables (`replication_num=3`, `DISTRIBUTED BY 
HASH(...) BUCKETS 16`) by `INSERT … SELECT` from an Iceberg (S3 Tables) 
external catalog plus `UPDATE … FROM` maintenance passes — the same shape of 
load that originally wedged us.
   
   1. **Sustained build** — ~26 min of continuous repl=3 writes, ~222M rows 
across the 4 tables.
   2. **Idle-watch** — a repl=3 write probe every ~40 s for 16 min after the 
build, covering the 8–11 min post-load window where our original wedge fired.
   3. **Rapid sequential churn** — 60 fast repl=3 multi-replica `INSERT`s (1–3 
s each), the original "fast small load" shape.
   4. **Concurrent churn** — 4 parallel writers (~100 concurrent repl=3 loads), 
the concurrent shape that wedged us once before.
   
   Per-BE instrumentation each run (the same signals from the original capture):
   
   | Signal | Original wedge (4.0.6) | This retest — 4.1.2 | This retest — 
4.0.6 control |
   | --- | --- | --- | --- |
   | `open_load_stream` handler entries during window | **0** (handler never 
entered) | ~5,100 cluster-wide | ~4,000 cluster-wide |
   | `[E1008]Reached timeout … @<be>:8060` (incl. loopback) | present | **0** | 
**0** |
   | brpc `:8060` "Broken" / never-revived socket | present | **0** | **0** |
   | `SHOW BACKENDS` Alive | true (misleading) | true | true |
   | Write path | wedged (full BE restart to recover) | healthy throughout | 
healthy throughout |
   
   On 4.1.2 the `open_load_stream` handler was entered ~5,100 times with zero 
brpc timeouts and zero broken sockets — a clean inversion of the original 
signature (where the handler was entered **zero** times during the stall and 
`[E1008]` loopback timeouts dominated). The 4.0.6 control behaved the same.
   
   ### Why we think it didn't reproduce
   
   In our original incident the reliable wedges all occurred on clusters 
carrying **accumulated degradation history**: the cause-#1 8040 self-ingress 
was *missing* at first, which drove an unbounded clone-repair storm, and we had 
also cycled single-BE restarts. The brpc load-stream socket went "Broken" 
against that degraded backdrop. On a **fresh** cluster with the 8040 rule in 
place from t=0, the clone storm never happens and we have not been able to 
drive the BE into the "Broken socket" state — on **either** 4.0.6 or 4.1.2. 
This suggests cause #3, as we observed it, may not be reachable independently 
of the cause-#1 degradation, rather than being a pure function of the 
steady-state multi-replica write workload.
   
   ### Honest limitations
   
   - **Scale.** This retest was at POC scale (one reporting period, 4 of our 
domains, ~222M output rows, ~26 min). Our production load is ~50× larger and 
runs for hours. Because the wedge is an *accumulation* failure, we cannot rule 
out that it only manifests under the full-scale, many-hour sustained load — 
which neither cluster has yet seen.
   - **Non-determinism.** Even originally on 4.0.6 the wedge was intermittent 
(one fresh-cluster build ran clean back then too), so a single clean pass is 
not proof of a fix.
   
   ### Where this leaves the issue
   
   We are **not** claiming 4.1.2 fixes the underlying brpc behaviour — we could 
not exercise the defect to test that claim. What we can say is: a fresh, 
correctly-configured cluster did not reproduce the wedge on 4.0.6 or 4.1.2 
under heavy, idle, rapid, and concurrent multi-replica load. If it would help 
triage, our next steps would be (a) deliberately recreating the degraded 
precondition (8040 ingress absent → clone storm → single-BE restart) to drive a 
4.0.6 wedge, then re-testing 4.1.2 under those same conditions, and/or (b) a 
full production-scale sustained load. Happy to run either and report back, and 
to pull fresh `pstack` / brpc `/vars` / `/rpcz` dumps if you want them from any 
of these runs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to