ojalberts-itc commented on issue #64708:
URL: https://github.com/apache/doris/issues/64708#issuecomment-4778070098

   ## Update: the wedge is **not OS-related** — reproduced on Ubuntu 22.04 
(Doris 4.0.6)
   
   To exclude the operating system as a factor, we reproduced this wedge on a 
**second, different OS**.
   We previously reported it on **Amazon Linux 2023**; we have now reproduced 
the *same* wedge, with
   the *same* in-process signature, on **Ubuntu 22.04 LTS**, using **identical 
infrastructure** — same
   Terraform, instance types, topology, security groups (incl. the cause-#1 
BE→BE `:8040` self-ingress
   rule), `be.conf`/`fe.conf`, replication factor, and Doris build. **The only 
variable changed is the
   host OS.** The wedge is present on both. **The OS is not the cause** — 
consistent with an upstream
   brpc/Doris load-stream defect.
   
   This separates **existence** (4.0.6 exhibits the wedge — now on two OSes) 
from **trigger** (an
   accumulated cluster under multi-replica load); the trigger evidence is 
unchanged from the prior
   4.0.6/4.1.2 reports.
   
   ### Version
   
   `doris-4.0.6-rc02-1663f25c16f` — official x64 (AVX2) GA tarball. Build path 
in the stacks is
   `/home/zcp/repo_center/doris_release/doris/be/src/...` 
(`ldb-toolchain-v0.26`, gcc 15).
   
   ### Environment (identical to the Amazon Linux run except the OS)
   
   - 3 FE + 4 BE, AWS `r8i.2xlarge` (8 vCPU / 64 GiB), coupled 
(storage-compute) mode, replication = **3**.
   - `experimental_enable_single_replica_insert` = **false the whole time** (no 
toggle — see confounds).
   - **OS: Ubuntu 22.04.5 LTS, kernel 6.8.0-1057-aws** (vs Amazon Linux 2023, 
kernel 6.1, in the prior run).
   
   ### Trigger / preconditions (repro-survivable)
   
   Honest precondition, so a fresh-cluster smoke test does not "disprove" this: 
native ingest and a
   2.8B-row federated read **passed at deploy** (03:21–03:22 UTC), i.e. the 
`:8060` load path worked
   when the cluster was new. The wedge then appeared **~20 min later**, after 
accumulated repl=3
   cross-BE activity (native ingest + internal `audit_log` stream-loads + the 
test's reads/writes) —
   **before** any heavy bulk build. It is **persistent**: still wedged **~6 
hours after onset** (a
   repl=3 write canary wedged **4/6** at +6 h; ~6,000 `[E1008]` lines on 
`:8060` in the first 25 min,
   still ~212 per 15 min hours later). It is *intermittent at the cluster 
level* (a given repl=3 write
   wedges only when its tablet replica set includes a wedged BE) but 
**persistent on the affected BEs**,
   and it **spreads** BE→BE over time. This matches the known "accumulated 
cluster" precondition;
   existence on Ubuntu is the new fact here.
   
   ### The parked write thread — same `LoadStreamStub::open`, now on Ubuntu
   
   Doris's own `be.WARNING` prints the stack at the failure point:
   
   ```text
   open stream failed: [INTERNAL_ERROR]Failed to connect to backend <id>: 
[E1008]Reached timeout=60000ms @10.0.0.133:8060
     0#  doris::LoadStreamStub::open(...)                       
be/src/vec/sink/load_stream_stub.cpp:208
     1#  doris::LoadStreamStubs::open(...)                      
be/src/vec/sink/load_stream_stub.cpp:574
     2#  doris::VTabletWriterV2::_open_streams_to_backend(...)  
be/src/vec/sink/writer/vtablet_writer_v2.cpp:317
     3#  doris::VTabletWriterV2::_open_streams()                
be/src/vec/sink/writer/vtablet_writer_v2.cpp:298
     4#  doris::VTabletWriterV2::open(...)  →  AsyncResultWriter::process_block 
 →  ThreadPool::dispatch_thread
   ```
   
   ### `[E1008]` on `:8060` across all 4 BEs (parks growing to 538s / 900s)
   
   The broken load-stream socket is never revived; the cancel propagates as
   `failed to write enough replicas`:
   
   ```text
   load_stream_stub.cpp:369  LoadStreamStub ... src_id=...884, dst_id=...881 is 
cancelled because of
      [INTERNAL_ERROR]Failed to connect to backend ...881: [E110]... 
Socket{...addr=10.0.0.43:8060...}: Connection timed out
   load_stream_stub.cpp:591  open stream failed: ...[E112]Not connected to 
10.0.0.43:8060 yet, server_id=217
   vtablet_writer_v2.cpp:321 failed to open stream to backend ...881
   brpc_closure.h:128        RPC meet failed: [E1008]Reached timeout=537999ms 
@10.0.0.133:8060
   brpc_closure.h:128        RPC meet failed: [E1008]Reached timeout=900000ms 
@10.0.0.53:8060
   fragment_mgr.cpp:1177/1198 brpc stub: 10.0.0.118:8060 check failed [E1008] → 
remove brpc stub from cache
   stream_load_executor.cpp:107 ... node=10.0.0.133:8060, open failed, failed 
to open tablet writer,
      error=RPC call is timed out, [E1008]Reached timeout=60000ms 
@10.0.0.133:8060 ... elapse(s)=64
   ```
   
   ### Workers parked (not saturated), zero-clone, all `Alive=true`
   
   brpc `:8060` `/vars` at the wedge: `bthread_worker_usage` **≈ 1.1** (of 256 
workers),
   `bthread_count` 5, `load_stream_count` 1–3 — threads blocked on the RPC, not 
starved. FE
   `SHOW PROC '/cluster_balance/{running,pending}_tablets'` was **empty** 
(zero-clone — isolated from
   the cause-#1 tablet-clone path). `SHOW BACKENDS` showed all 4 BEs 
`Alive=true` throughout (the 9050
   heartbeat is a separate threadpool). Only a **simultaneous full-BE restart** 
clears it; a single-BE
   restart does not.
   
   ### What we ruled out on Ubuntu, with positive evidence (captured while 
wedged)
   
   Each environment cause excluded by direct test on the wedged Ubuntu cluster 
— so "it's your
   network/OS" is answered up front:
   
   - **Raw L4 to `:8060` is OPEN** — `cat >/dev/tcp/<peer>/8060` succeeds to 
every peer; 11–14
     **ESTABLISHED** `:8060` sockets per BE in `ss`. The stall is in the brpc 
application layer, not TCP.
   - **No host firewall** — `ufw` inactive; `nftables` ruleset empty; 
`iptables` default-`ACCEPT` only.
   - **conntrack not exhausted** — `nf_conntrack_count` 0 / max 1048576.
   - **No Nitro/ENA throttle** — `bw_in/out_allowance_exceeded`, 
`pps_allowance_exceeded`,
     `conntrack_allowance_exceeded`, `linklocal_allowance_exceeded` all **0** 
on all 4 BEs.
   - **No SELinux** (Ubuntu); AppArmor enabled with **no Doris profile** loaded.
   
   ### Confounds (disclosed) and why OS is excluded
   
   - **No mitigation-toggle confound** here: `single_replica_insert` was 
`false` **from t=0**, never
     toggled (the original 4.0.6 occurrence had a toggle ~2 min prior; this run 
removes that variable).
   - **OS is the only delta** between this run and the Amazon Linux run 
(identical Terraform/config/build).
     Two reproductions whose environments differ in exactly the OS → the OS is 
not the cause.
   - **`enable_brpc_connection_check=true` does not fix it** (it was on 
throughout): it detects the
     broken stub and evicts it from cache (`remove brpc stub from cache`), but 
the socket never revives.
   
   ### What we expected
   
   A BE-to-BE load-stream `brpc` socket that goes `Broken`/times out should be 
detected and
   **re-established** so writes recover without operator intervention; instead 
it is never revived and
   only a full-fleet BE restart clears it.
   
   ### Questions
   
   1. With the OS now excluded by a second-OS reproduction (Ubuntu 22.04 + 
Amazon Linux 2023, identical
      infra), does this confirm the failure is in the brpc load-stream layer 
rather than the environment?
   2. Is this the brpc **#1168** class (socket `Broken`, never revived), and is 
there a fixing PR or a
      Doris version that resolves it?
   3. Should `enable_brpc_connection_check` be able to recover this socket, or 
is eviction-without-revive
      the expected (insufficient) behaviour here?
   
   ## Willing to submit a PR
   
   No, but **happy to test a patch** or run any targeted capture you want on 
the live wedged cluster.
   
   ## What we captured / can provide
   
   Attached `64708-ubuntu-4.0.6-evidence.tar.gz`
   (sha256 `03896c90b393d8577f007ebffe69bb82b0831fb9ba1697d6a9dd000c7fab2950`): 
per-BE `be.WARNING`
   load-stream failures + brpc `:8060` `/vars`, the parked-stack backtrace, FE 
cluster-state
   (`Alive=true` + empty `cluster_balance`), the environment-exclusion probes, 
and the persistence
   timeline. Full per-BE `gdb 'thread apply all bt'` dumps, `/vars`, `/rpcz`, 
and complete `be.WARNING`
   are available on request.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to