ojalberts-itc commented on issue #64708:
URL: https://github.com/apache/doris/issues/64708#issuecomment-4777922226
## Update: the wedge **reproduces on Doris 4.1.2** (`doris-4.1.2-rc01`)
Following the previous comment (where we promised a 4.1.2 test): we
redeployed the cluster
**fresh on 4.1.2** (`doris-4.1.2-rc01-aec169d2025`, official x64 GA tarball;
3 FE + 4 BE, repl=3;
the cause-#1 BE→BE `:8040` self-ingress SG rule present from t=0; empty data
volumes) and ran a
heavy multi-replica write workload with
`experimental_enable_single_replica_insert` **OFF**.
**It wedged — identical signature to 4.0.6.** The workload (a ~222M-row
repl=3 build + churn) ran
clean *during* the load, then the write path wedged at **idle, ~10 minutes
after the last heavy
write**. We captured all 4 BEs live before any restart.
### The parked write thread — same `LoadStreamStub::open` as before, now on
4.1.2
Doris's own `be.WARNING` printed the stack at the failure point:
```text
open stream failed: [INTERNAL_ERROR]Failed to connect to backend <id>:
[E1008]Reached timeout=60000ms @10.0.0.227:8060
0# doris::LoadStreamStub::open(...)
be/src/exec/sink/load_stream_stub.cpp:208
1# doris::LoadStreamStubs::open(...)
be/src/exec/sink/load_stream_stub.cpp:574
2# doris::VTabletWriterV2::_open_streams_to_backend(...)
be/src/exec/sink/writer/vtablet_writer_v2.cpp:317
3# doris::VTabletWriterV2::_open_streams()
be/src/exec/sink/writer/vtablet_writer_v2.cpp:298
```
(Build path `/home/zcp/repo_center/doris_release/doris/be/src/...` confirms
the official 4.1.2 GA build.)
### `[E1008]` on `:8060` across all 4 BEs
`be.WARNING` E1008/broken-socket counts at the wedge were **23 / 34 / 30 /
24** across the four
BEs. The `enable_brpc_connection_check` path detects the broken socket and
evicts it from cache,
but it never revives:
```text
fragment_mgr.cpp:953 brpc stub: 10.0.0.227:8060 check failed:
[E1008]Reached timeout=10000ms @10.0.0.227:8060
fragment_mgr.cpp:974 remove brpc stub from cache: 10.0.0.227:8060, error:
[E1008]...
brpc_client_cache.h:326 open brpc connection to 10.0.0.137:8060 failed:
[E1008]Reached timeout=2000ms @10.0.0.137:8060
vtablet_writer.cpp:715 failed to open tablet writer may caused by timeout
...
```
Raw TCP to `:8060` stays open throughout; the stall is in the brpc
application layer.
### Workers parked, not saturated — and zero clone activity
brpc `:8060` `/vars` at the wedge: `bthread_worker_usage` **0.46–1.76**,
`bthread_count` 5, with
`rpc_server_8060_connection_count` 58–59. The write threads are blocked on
the RPC, not starved.
FE `SHOW PROC '/cluster_balance/{running,pending}_tablets'` was **empty** —
a zero-clone wedge,
purely the brpc load-stream socket going Broken. `SHOW BACKENDS` showed all
4 BEs `Alive=true`
throughout (the 9050 heartbeat is a separate threadpool). Both repl=3
**and** repl=1 writes hung;
only a **simultaneous full-BE restart** cleared it.
### Preconditions — this is why a fresh-cluster smoke test will NOT
reproduce it
This is the important part if you try to reproduce. **A fresh cluster
running a single heavy load
does not wedge.** In our run, the identical 85M-row repl=3 load on a
freshly-bootstrapped 4.1.2
cluster completed in **73s, clean**. The same SQL, same
`single_replica_insert=OFF`, on a cluster
that had **accumulated history** then **failed** (125s, then 365s on retry).
The discriminating
precondition is the cluster's accumulated/degraded state; heavy
multi-replica load is the proximate
trigger *on top of* that. To reproduce, expect to need sustained
accumulation (a full multi-table
build plus a burst of concurrent repl=3 loads), not one load on a clean
cluster.
We had two occurrences, with **complementary** confounds that cancel out:
1. **First (fresh bootstrap, no restart):** a full ~222M-row build + churn
ran clean, then the
write path wedged at idle ~10 min later. This was on a cluster that had
**never** been
`systemctl restart`-ed — so it is **not** explained by the known
clone-storm-after-hard-restart
path. (It did have `single_replica_insert` toggled OFF→ON ~2 min before
the wedge, so this
occurrence alone couldn't rule out the toggle.)
2. **Replay (no toggle):** we re-ran with `single_replica_insert` **OFF the
whole time, no toggle**;
the heavy load wedged again (the `[E1008]` / `failed to write enough
replicas 1/3` /
`failed to open streams to any BE` above). So the **toggle is not
necessary**.
Union of the two: the wedge appears **without the toggle** and **without a
hard-restart-degraded
cluster** — so neither is the cause. The common factor is an accumulated
cluster under sustained
multi-replica load, on 4.1.2.
Net: **the brpc BE write-wedge is present on Doris 4.1.2 — it is not
fixed.** The in-process
signature (the `LoadStreamStub::open` stack, `[E1008]` on `:8060` across all
BEs, zero-clone,
all-`Alive=true`, full-restart-only recovery) is identical to the 4.0.6
reports — the brpc #1168
class. Full per-BE `be.WARNING` excerpts, the brpc `/vars`, and FE
cluster-state for **both**
occurrences are attached as `64708-4.1.2-evidence.tar.gz`
(sha256 `e52e4c76c2f25dd87d8aab2434dc332530e20ea196eff0c85f67196d7855c2fd`).
Original Question 1
stands: is this a known brpc 1.4.0 load-stream socket defect, and is there a
fixing PR or a
version that resolves it?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]