He-Pin opened a new pull request, #2986: URL: https://github.com/apache/pekko/pull/2986
### Motivation `GraphInterpreter`'s chase loops dominate hot-path CPU in steady state. JMH stack profiling on `InterpreterBenchmark` (`numberOfIds=10`) attributes ~50% of stream-related samples to the two `while` loops at `execute:449` and `execute:460`, after deep JIT inlining of `processPush` / `onPush` / `push` / `grab` / `chasePush`. Every chase iteration ends with `afterStageHasRun(activeStage)`, which in steady state always reads `shutdownCounter(activeStage.stageId)` and the per-stage finalized flag, only to discover the stage has not just completed and skip the body. That is a per-event array load + null check + branch on the hottest path with no work to do, which the JIT cannot fold away because `shutdownCounter` is a mutable shared array. ### Modification - Add `pendingFinalization: Boolean` on the interpreter, set when a stage's `shutdownCounter` decrements to 0 in `completeConnection`, or transitions to 0 when `KeepGoing` is cleared in `setKeepGoing`. - Gate the three hot-path `afterStageHasRun(activeStage)` call sites in `execute()` (post normal-dispatch and the two chase loops) on the flag, resetting it to `false` before invoking `afterStageHasRun` so cascaded completions during finalization re-arm the flag correctly. - The lower-frequency `afterStageHasRun` callers in `init()` and `runAsyncInput` are intentionally left untouched — they run once per stage / per async event and are not on the hot path. The semantic invariant is preserved: any path that decrements `shutdownCounter` to 0 sets the flag, so any state where `isStageCompleted(activeStage)` could newly return true is guaranteed to be observed by the next gated call. ### Result JMH on `InterpreterBenchmark` (JDK 25, G1, single thread, `-i 5 -wi 3 -f 1 -t 1`): ``` numberOfIds baseline (ops/ms) with patch (ops/ms) delta 1 45238 ± 3143 50952 ± 4784 +12.6% 5 10526 ± 151 11242 ± 288 +6.8% (CIs disjoint) 10 5350 ± 193 5927 ± 173 +10.8% (CIs disjoint) ``` `numberOfIds=5` and `=10` show non-overlapping 99.9% confidence intervals vs the same-tree baseline, so the gain is not noise. Allocation rate stays at ~0.6 B/op (0 GC counts in the measurement window) — no GC impact. ### Tests - `sbt 'stream/compile'` - `sbt 'stream/mimaReportBinaryIssues'` — clean - `sbt 'stream-tests/testOnly *fusing*'` — 159 tests, all passed (covers `GraphInterpreterSpec`, `GraphInterpreterPortsSpec`, completion / cancel / fail paths) - `sbt 'stream-tests/testOnly *Flow*Spec'` — 1208 tests, all passed - `sbt 'bench-jmh/Jmh/run -i 5 -wi 3 -f 1 -t 1 .*InterpreterBenchmark.*'` — numbers above ### References Refs #2985 — relies on the `InterpreterBenchmark` correctness fix in that PR to obtain trustworthy JMH numbers. This branch contains the two commits from #2985 plus the optimization commit; if #2985 lands first, this branch will be rebased to drop those. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
