He-Pin opened a new pull request, #2986:
URL: https://github.com/apache/pekko/pull/2986

   ### Motivation
   `GraphInterpreter`'s chase loops dominate hot-path CPU in steady state. JMH 
stack profiling on `InterpreterBenchmark` (`numberOfIds=10`) attributes ~50% of 
stream-related samples to the two `while` loops at `execute:449` and 
`execute:460`, after deep JIT inlining of `processPush` / `onPush` / `push` / 
`grab` / `chasePush`.
   
   Every chase iteration ends with `afterStageHasRun(activeStage)`, which in 
steady state always reads `shutdownCounter(activeStage.stageId)` and the 
per-stage finalized flag, only to discover the stage has not just completed and 
skip the body. That is a per-event array load + null check + branch on the 
hottest path with no work to do, which the JIT cannot fold away because 
`shutdownCounter` is a mutable shared array.
   
   ### Modification
   - Add `pendingFinalization: Boolean` on the interpreter, set when a stage's 
`shutdownCounter` decrements to 0 in `completeConnection`, or transitions to 0 
when `KeepGoing` is cleared in `setKeepGoing`.
   - Gate the three hot-path `afterStageHasRun(activeStage)` call sites in 
`execute()` (post normal-dispatch and the two chase loops) on the flag, 
resetting it to `false` before invoking `afterStageHasRun` so cascaded 
completions during finalization re-arm the flag correctly.
   - The lower-frequency `afterStageHasRun` callers in `init()` and 
`runAsyncInput` are intentionally left untouched — they run once per stage / 
per async event and are not on the hot path.
   
   The semantic invariant is preserved: any path that decrements 
`shutdownCounter` to 0 sets the flag, so any state where 
`isStageCompleted(activeStage)` could newly return true is guaranteed to be 
observed by the next gated call.
   
   ### Result
   JMH on `InterpreterBenchmark` (JDK 25, G1, single thread, `-i 5 -wi 3 -f 1 
-t 1`):
   
   ```
   numberOfIds   baseline (ops/ms)    with patch (ops/ms)    delta
   1             45238 ± 3143         50952 ± 4784           +12.6%
   5             10526 ±  151         11242 ±  288            +6.8%   (CIs 
disjoint)
   10             5350 ±  193          5927 ±  173           +10.8%   (CIs 
disjoint)
   ```
   
   `numberOfIds=5` and `=10` show non-overlapping 99.9% confidence intervals vs 
the same-tree baseline, so the gain is not noise. Allocation rate stays at ~0.6 
B/op (0 GC counts in the measurement window) — no GC impact.
   
   ### Tests
   - `sbt 'stream/compile'`
   - `sbt 'stream/mimaReportBinaryIssues'` — clean
   - `sbt 'stream-tests/testOnly *fusing*'` — 159 tests, all passed (covers 
`GraphInterpreterSpec`, `GraphInterpreterPortsSpec`, completion / cancel / fail 
paths)
   - `sbt 'stream-tests/testOnly *Flow*Spec'` — 1208 tests, all passed
   - `sbt 'bench-jmh/Jmh/run -i 5 -wi 3 -f 1 -t 1 .*InterpreterBenchmark.*'` — 
numbers above
   
   ### References
   Refs #2985 — relies on the `InterpreterBenchmark` correctness fix in that PR 
to obtain trustworthy JMH numbers. This branch contains the two commits from 
#2985 plus the optimization commit; if #2985 lands first, this branch will be 
rebased to drop those.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to