[I] [Feature] Add Power-of-Two-Choices Peak-EWMA load balancer (p2c) — O(1) tail-latency-aware policy (brpc)

via GitHub Fri, 12 Jun 2026 21:35:37 -0700


rajvarun77 opened a new issue, #3340:
URL: https://github.com/apache/brpc/issues/3340


   **Is your feature request related to a problem?**
   
   brpc has several client-side load balancers (`rr`, `wrr`, `random`, `wr`, 
`la`, the consistent-hash family, `_dynpart`), but only one of them is 
latency-aware: `la` (locality-aware). `la` maintains a doubly-buffered **weight 
tree over all servers** with weights driven by inverse average latency, and 
samples a single point on the cumulative-weight line.
   
   brpc currently has **no Power-of-Two-Choices (P2C) sampling load balancer** 
— neither Envoy-style least-request (P2C over active-request-biased weights) 
nor Finagle-style Peak-EWMA. P2C is the single most widely deployed 
tail-latency balancer in modern RPC stacks and service meshes (Finagle/Twitter 
`p2cPeakEwma`, linkerd, Envoy `LEAST_REQUEST`), and it is absent from brpc.
   
   This matters in two regimes that `rr`/`random` ignore and where `la`'s 
averaging window lags:
   - a **single slow/degraded backend** (GC pause, noisy neighbor, cold cache) 
— should be shed within one sample, not after an averaging window;
   - a **heterogeneous fleet** (mixed CPU generations/capacities) — selection 
should continuously bias toward faster, less-loaded nodes.
   
   **Describe the solution you'd like**
   
   Add a new client-side load balancer, registered as **`p2c`** (alias 
`p2c_ewma`): **Power-of-Two-Choices with Peak-EWMA latency scoring**.
   
   Algorithm (per request):
   1. Sample **two distinct random backends** from the server list 
(`fast_rand_less_than`, as `rr`/`random` already do).
   2. Route to the one with the **lower load score**:
      - `score = ewma_us * (inflight + 1) / max(weight, 1)`
      - `ewma_us` is a **peak-sensitive EWMA of round-trip latency** per 
backend: an upward latency spike sets the score to the spike value immediately; 
downward recovery decays with a configurable time constant `TAU` (Finagle 
default decay ≈ 10s). This is what sheds a stalled backend within one sample.
      - `inflight` is the backend's outstanding-request count (incremented at 
select, decremented at completion) — the Envoy `LEAST_REQUEST` `(active+1)` 
term.
      - dividing by the static naming-service `weight` makes it **weighted 
P2C** (degrades to unweighted when weights are equal).
   3. On RPC completion, update the chosen backend's EWMA and decrement its 
in-flight; failures/timeouts inflate the latency (`max(measured, timeout)`) so 
a failing node is avoided — mirroring `la`'s error handling.
   
   Properties: **O(1) selection** (two score evaluations regardless of fleet 
size, vs `la`'s O(log N) tree walk); spike-reactive via the peak term; 
herding-resistant via two-sample comparison.
   
   **Interface feasibility — no brpc core change required.** Every hook 
P2C-EWMA needs already exists and is exercised by `la`:
   - `src/brpc/load_balancer.h` — `SelectOut::need_feedback` 
(load_balancer.h:50), `virtual void Feedback(const CallInfo&)` 
(load_balancer.h:100), and `CallInfo { begin_time_us; server_id; error_code; 
controller; }` (load_balancer.h:53-61) giving per-call latency, which backend, 
success/failure, and timeout.
   - `src/brpc/controller.cpp:873-876` — the RPC completion path actually 
invokes the hook:
     ```cpp
     if (need_feedback && c->_lb) {
         const LoadBalancer::CallInfo info =
             { begin_time_us, peer_id, error_code, c };
         c->_lb->Feedback(info);
     }
     ```
     and `controller.cpp:1120` propagates `sel_out.need_feedback` into the call.
   - Membership lives in `DoublyBufferedData<Servers, TLS>` (lock-free reads, 
copy-on-write modify) exactly as `round_robin_load_balancer.cpp`. Mutable 
per-node load state (`inflight`, `ewma_us`, `stamp_us`) is heap-allocated once 
per backend and referenced by stable pointer — the same technique 
`locality_aware_load_balancer` already uses for its per-node atomics.
   - brpc's `Socket` has no general app-level in-flight counter, so — like `la` 
— the LB owns its in-flight counters. Fully feasible.
   
   **Files to add / modify**
   - Add `src/brpc/policy/p2c_ewma_load_balancer.{h,cpp}`.
   - Modify `src/brpc/global.cpp`: include the header, add a 
`P2CEwmaLoadBalancer` member to `GlobalExtensions`, and register it next to the 
existing balancers — `LoadBalancerExtension()->RegisterOrDie("p2c", 
&g_ext->p2c_ewma_lb);`.
   - Add `test/brpc_p2c_ewma_load_balancer_unittest.cpp` (modeled on the 
`la`/`rr` LB unit tests), registered in all three build systems 
(`test/CMakeLists.txt`, `test/BUILD.bazel`, `test/Makefile`).
   - Add a `p2c` row to `docs/cn/lb.md` / `docs/en/lb.md`.
   
   **Performance / testing plan (multi-protocol, multi-config)**
   
   A cross-protocol benchmark matrix to prove the gain is real on actual 
brpc-supported backends, not just a synthetic echo server.
   - **Protocols:** baidu_std (protobuf RPC, controllable-sleep echo server — 
clean baseline), **Memcached** (`MemcacheRequest` — tiny ops, very high QPS, 
exposes selection overhead/herding), **Redis** (`RedisRequest` — 
single-threaded backend, high load sensitivity), **MySQL** (variable query cost 
— biggest tail surface), HTTP. Integration backends spawn-or-skip using the 
existing `brpc_redis_unittest.cpp` `system("which ...")` precedent, with one 
node acting as the **degraded** backend.
   - **Configs:** fleet size N ∈ {4, 16, 64, 256}; homogeneous vs heterogeneous 
latency; degradation injected (none / 1 node +50ms / 10% of fleet +20ms / 
transient spike-then-recover); offered load (low / knee / saturating); equal vs 
1:2:4 naming-service weights; pooled vs single connection.
   - **Functional tests (must pass first):** add/remove/batch, single-server, 
`ExcludedServers` exclusion, feedback-driven score shift, in-flight inc/dec 
across select→feedback (incl. error/timeout path), weighted 1:2:4 split, and 
concurrency under TSan/ASan.
   - **Metrics:** p50/p90/p99/p999/max latency; throughput at fixed SLO and 
saturating; per-node traffic share over time (divert latency after a spike); 
request-count variance (fairness/herding); selection CPU cost per op as N grows 
(P2C O(1) vs `la` O(log N)).
   - **Headline result to validate:** tail-latency reduction and 
throughput-at-SLO uplift of `p2c` over `rr`/`random`/`la` on the heterogeneous 
and degraded configs, with selection cost no worse than `rr`.
   
   **Describe alternatives you've considered**
   - **Extend `la`** — different algorithm (global weight tree, averaging 
window); does not give O(1) sampling or peak spike-reactivity. Keeping both 
gives users a real choice.
   - **Envoy-style least-request only** (`weight / (active+1)`) — a strict 
subset; P2C-EWMA folds the `(inflight+1)` active-request term *and* the 
peak-latency term into one score, so it covers least-request as the 
equal-latency special case.
   - **WRR + ORCA load reports (gRPC gRFC A58)** — requires a server-reported 
load backchannel brpc does not have; out of scope for a client-only change.
   
   **Additional context**
   
   References (primary sources):
   - Envoy — Supported load balancers (LEAST_REQUEST = P2C, `weight = lb_weight 
/ (active+1)^bias`, "resistance to herding"): 
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/load_balancers.html
   - Envoy — LEAST_REQUEST can pick the busier host (P2C rationale): 
https://github.com/envoyproxy/envoy/issues/14859
   - Finagle — Aperture Load Balancers (`p2cPeakEwma`; Peak-EWMA = moving 
average over RTT highly sensitive to peaks, weighted by outstanding requests): 
https://twitter.github.io/finagle/guide/ApertureLoadBalancers.html
   - Finagle — `Balancers` API (`p2cPeakEwma` factory): 
https://twitter.github.io/finagle/docs/com/twitter/finagle/loadbalancer/Balancers$.html
   - linkerd — "Beyond Round Robin: Load Balancing for Latency" (Peak-EWMA; 
p99.9 wins over RR/least-loaded): 
https://linkerd.io/2016/03/16/beyond-round-robin-load-balancing-for-latency/
   - gRPC — gRFC A58 Client-Side Weighted Round Robin (ORCA load reports): 
https://github.com/grpc/proposal/blob/master/A58-client-side-weighted-round-robin-lb-policy.md
   
   I'm happy to implement this end-to-end (balancer + unit tests in all three 
build systems + docs + the benchmark matrix). Would the maintainers be open to 
a `p2c` policy, and is a GitHub issue the right place to scope it, or would you 
prefer a `[DISCUSS]` thread on `[email protected]` first?
   
   
   cc @zyearn (load-balancer policy owner / author of `la`) — would value your 
read on whether a `p2c` policy fits brpc.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Feature] Add Power-of-Two-Choices Peak-EWMA load balancer (p2c) — O(1) tail-latency-aware policy (brpc)

Reply via email to