rajvarun77 opened a new pull request, #3339:
URL: https://github.com/apache/brpc/pull/3339

   ### Problem
   
   `brpc_channel_unittest` bundles dozens of timing-sensitive `TEST_F` (backup
   request, retry/backoff, timeouts, connection-failure) into a single test
   binary. Each test does real-time waits (server-side `sleep_us`, 
backup-request
   timers, connection retries). gtest runs them serially in one process, so the
   binary's wall time is the **sum** of all those waits.
   
   On contended CI runners (GitHub-hosted `ubuntu-22.04`, ~4 shared vCPU with
   hypervisor steal) that cumulative time exceeds Bazel's **default per-test 
300s
   limit** (`size = "medium"`), so the binary intermittently fails with 
`TIMEOUT`
   even though every assertion would pass given enough time.
   
   ### Evidence (reproduced on GitHub Actions)
   
   Measured on `ubuntu-22.04`, `--nocache_test_results`:
   
   | Configuration | Result |
   | --- | --- |
   | current (`size=medium`, 300s), 5 runs under load | **TIMEOUT in 4/5 @ 
300.0s** |
   | `size=large` (900s), single run | **PASSED in 91.7s** |
   | `size=large` (900s), 20 serialized no-cache runs | **20/20 PASSED, slowest 
114.0s** |
   
   The nominal run is ~92–114s, but under parallel-job contention the same 
binary
   balloons past 300s — a ~3× slowdown that crosses the medium ceiling. Raising 
the
   limit to `large` (900s) gives ~8× nominal headroom and absorbs the spike.
   
   Bench runs (throwaway branch, not part of this PR):
   - baseline `TIMEOUT 4/5` + rejected `shard_count=4` experiment `FAILED 
20/20`:
     https://github.com/rajvarun77/brpc/actions/runs/27396621709
   - `size=large` validation (20/20 serialized + 91.7s timing):
     https://github.com/rajvarun77/brpc/actions/runs/27453397271
   
   ### Fix
   
   Add an optional `per_test_size` override to the `generate_unittests` macro 
and
   set `brpc_channel_unittest` to `size = "large"`. **No test source changes.**
   
   ### Why not shard it?
   
   Sharding (`shard_count`) was tried first and **rejected**: it fails
   deterministically (20/20). `brpc_channel_unittest`'s `TEST_F` share fixed
   loopback endpoints and global state, so running shards as parallel processes
   makes a "connection should be refused" test
   (`ChannelTest.connection_failed_selective`) observe **another shard's live
   server** on the same port and see a successful connection instead of
   `ECONNREFUSED`. The tests are not shard-safe; raising the size limit is the 
only
   safe lever without rewriting the suite for isolation.
   
   
   ---
   cc @chenBright for review.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to