He-Pin opened a new pull request, #3007:
URL: https://github.com/apache/pekko/pull/3007

   ### Motivation
   
   JDK 25 nightly stream tests hang for the full test timeout (the recurring 
failures behind #2573 / #2870). A local reproduction — the full `stream-tests` 
run with the nightly `virtualize=on` + `timefactor=4` options on JDK 25 — pins 
it down:
   
   - one `...-pekko.test.stream-dispatcher-CarrierThread-N` consumes **~97% 
CPU** (cpu time ≈ elapsed time) stuck in `AbstractNodeQueue.pollNode`,
   - every other carrier is idle in `ForkJoinPool.awaitWork`,
   - a full virtual-thread dump shows **no producer thread anywhere**.
   
   The spinning consumer is a virtual thread, so the unbounded CPU spin pins 
its carrier permanently; the stream never progresses and the test's 
`futureValue` never completes. Every affected test passes in isolation (~100ms) 
even with the full nightly JVM options — the hang only appears under sustained 
load, because it is a JIT-state-dependent data race.
   
   ### Root cause
   
   PR #1990 (*avoid sun.misc.Unsafe by using VarHandles*) mapped the **producer 
writes** correctly (`Unsafe.putOrderedObject` → `VarHandle.setRelease`) but 
**downgraded every consumer read** from `Unsafe.getObjectVolatile` (a 
volatile/acquire load) to `VarHandle.get` — which has **plain** memory 
semantics even when the field is declared `volatile`.
   
   A plain read is not ordered against the producer's release store, so it 
establishes no happens-before with the published node. Inside the busy-spin 
loops in `peekNode`/`pollNode`:
   
   ```java
   do { next = tail.next(); } while (next == null);
   ```
   
   the JIT may hoist the plain load out of the loop, producing an unbounded 
spin that never observes the linked next node. JDK 25's C2 makes this manifest 
reliably, and virtual-thread carriers turn the transient spin into a permanent 
100% CPU pin.
   
   This is the same memory ordering that lock-free MPSC queues such as JCTools 
use (consumer-side `lvNext` / load-acquire on the next pointer); the original 
Unsafe code matched it, and #1990 inadvertently broke it.
   
   ### Modification
   
   - `Node.next()` and the four `tailHandle` reads (`peekNode`, `pollNode`, 
`isEmpty`, `count`) now use `getAcquire`, restoring the volatile-read semantics 
the code had before #1990 and pairing with the existing `setRelease` writes.
   - Added `Thread.onSpinWait()` to both busy-spin loops as standard spin-wait 
hygiene.
   
   ### Performance
   
   This **restores** the pre-#1990 memory semantics rather than adding new cost:
   
   | Arch | plain `get` (since #1990) | `getAcquire` (this PR / original Unsafe 
`getObjectVolatile`) |
   
|------|---------------------------|-------------------------------------------------------------|
   | x86-64 | `MOV` | `MOV` (all x86 loads already have acquire semantics) — 
**zero difference** |
   | AArch64 | `LDR` | `LDAR` (single instruction) — **same as pre-#1990** |
   
   Net effect versus the original Unsafe-based design is zero on x86-64 and 
negligible on AArch64; it only removes the broken plain-read micro-optimization 
the VarHandle migration introduced. Method signatures are unchanged → no 
binary-compatibility impact.
   
   ### Result
   
   With the fix, the previously hanging `HubSpec "work with long streams if one 
of the producers is slower"` completes in ~2.7s (was stuck for the full 
timeout), and the full `stream-tests` run proceeds past the point where it 
previously hung (1800+ tests, no hang) under the same nightly `virtualize=on` + 
`timefactor=4` JVM options on JDK 25.
   
   ### Tests
   
   - `sbt actor/compile` succeeds.
   - Local full `stream-tests` run with nightly JVM options (`virtualize=on`, 
`minimum-runnable=8`, `timefactor=4`) on JDK 25.0.1 no longer hangs at HubSpec 
and proceeds normally; the specific previously-hanging test now passes in ~2.7s.
   - Signatures unchanged, so MiMa is unaffected.
   
   ### References
   
   - https://github.com/apache/pekko/issues/2870
   - https://github.com/apache/pekko/issues/2573
   - Regression introduced in #1990


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to