He-Pin opened a new pull request, #3017:
URL: https://github.com/apache/pekko/pull/3017
## Motivation
`MixedProtocolClusterSpec` "be allowed to join a cluster with a node using
the pekko protocol (udp)" still fails on virtualized JDK 25 runs after #2997
reordered `shutdownAll` to stop joining nodes first:
```
[WARN] CoordinatedShutdown(pekko://MixedProtocolClusterSpec) Coordinated
shutdown phase [actor-system-terminate] timed out after 30000 milliseconds
java.lang.RuntimeException: Failed to stop [MixedProtocolClusterSpec] within
[1 minute]
... StreamSupervisor ... remote-6-0-unnamed ActorGraphInterpreter
```
The `"within [1 minute]"` outer await is a `30s` base dilated by
`pekko.test.timefactor` — i.e. this lane runs at **tf=2** (`30s × 2 = 60s`).
The JDK 25 nightly runs at **tf=4** → `120s` and passes.
The `actor-system-terminate` phase only calls `system.finalTerminate()` and
**recovers** on its own (non-dilated) phase timeout while termination keeps
draining in the background
([`CoordinatedShutdown.scala:264-269`](https://github.com/apache/pekko/blob/main/actor/src/main/scala/org/apache/pekko/actor/CoordinatedShutdown.scala#L264-L269)).
So the inner-phase WARN is **non-binding noise** —
`ClusterTestUtil.shutdownAll`'s dilated await on `whenTerminated` is the real
pass/fail deadline. The aeron-udp transport is the slowest to drain (embedded
media driver + stacked Aeron liveness timeouts), so `60s` was simply too tight
at tf=2.
## Modification
- **`ClusterTestUtil.shutdownAll`**: raise the outer await base `30s → 60s`
(the binding, timefactor-dilated deadline), so a tf=2 lane gets ~`120s` — the
same headroom the tf=4 nightly already passes with. Added a comment explaining
why this await, not the inner phase, governs pass/fail.
- **`MixedProtocolClusterSpec` baseConfig**: raise the (non-dilated,
non-binding) `actor-system-terminate` phase timeout `30s → 60s` to suppress the
spurious WARN on the slow path and align it with the new await base.
This is a follow-up to #2997 — that PR pulled the shutdown-*ordering* lever;
this one fixes the binding *outer-await* deadline.
## Result
aeron-udp cluster systems get enough wall-clock to terminate cleanly on
lower-timefactor virtualized lanes without the shutdown-phase abort. Healthy
shutdowns still complete in well under a second, so local and normal CI runs
are unaffected. **Test-only change** — no production behaviour or
binary-compatibility impact.
## Tests
- `sbt "cluster/Test/compile"` — success (cluster test-classes compiled)
- `scalafmt 3.10.7` on both changed files — no reformatting needed
- `git diff --check` — clean
- aeron-udp shutdown timing is timefactor/environment dependent and does not
reproduce on local runs (shutdown completes <1s); the change is a timeout
widening verified by compile + format.
## References
- Follow-up to #2997 (reverse-order cluster shutdown)
- `nightly-builds.yml` `MixedProtocolClusterSpec` (udp) shutdown timeout
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]