(pekko) 01/01: fix: pair AbstractNodeQueue next read with acquire semantics

hepin Fri, 29 May 2026 02:53:22 -0700

This is an automated email from the ASF dual-hosted git repository.

He-Pin pushed a commit to branch fix/jdk25-nodequeue-acquire-spin
in repository https://gitbox.apache.org/repos/asf/pekko.git


commit 358f53ad427ac9b8cbf92b9caba1ecfd67276595
Author: He-Pin <[email protected]>
AuthorDate: Fri May 29 17:52:18 2026 +0800

    fix: pair AbstractNodeQueue next read with acquire semantics
    
    Motivation:
    JDK 25 nightly stream tests hang for the full test timeout. A local
    reproduction (full stream-tests with the nightly virtualize=on +
    timefactor=4 options) shows one
    `...-pekko.test.stream-dispatcher-CarrierThread-N` consuming ~97% CPU
    (cpu time approximately equal to elapsed time) stuck in
    `AbstractNodeQueue.pollNode`, while every other carrier is idle in
    `ForkJoinPool.awaitWork` and a full virtual-thread dump shows no
    producer thread anywhere. The spinning consumer is a virtual thread,
    so the unbounded CPU spin pins its carrier permanently; the stream
    never progresses and the test's `futureValue` never completes.
    
    Root cause: the MPSC queue publishes the linked node via
    `Node.setNext` = `nextHandle.setRelease(...)` (release store), but the
    consumer reads `Node.next()` = `nextHandle.get(...)`, which is a plain
    load — `VarHandle.get` has plain memory semantics even though the
    field is declared `volatile`. A plain read is not ordered against the
    producer's release store, so it establishes no happens-before with the
    published node, and inside the busy-spin loops in `peekNode`/`pollNode`
    (`do { next = tail.next(); } while (next == null);`) the JIT may hoist
    the plain load out of the loop. The result is an unbounded spin that
    never observes the linked next node. JDK 25's C2 makes this manifest
    reliably, and virtual-thread carriers turn the transient spin into a
    permanent 100% CPU pin.
    
    Modification:
    - `Node.next()` now uses `nextHandle.getAcquire(this)`, pairing with
      the `setRelease` in `setNext`. This establishes the missing
      happens-before and prevents the JIT from hoisting the load out of
      the spin loops.
    - Added `Thread.onSpinWait()` to both busy-spin loops (`peekNode`,
      `pollNode`) as standard spin-wait hygiene.
    
    Method signatures are unchanged, so there is no binary-compatibility
    impact.
    
    Result:
    With the fix, the previously hanging
    `HubSpec "work with long streams if one of the producers is slower"`
    completes in ~2.7s (was stuck for the full timeout) and the full
    stream-tests run proceeds past the point where it previously hung,
    under the same nightly virtualize=on + timefactor=4 JVM options on
    JDK 25.
    
    References: https://github.com/apache/pekko/issues/2870
---
 .../java/org/apache/pekko/dispatch/AbstractNodeQueue.java     | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git 
a/actor/src/main/java/org/apache/pekko/dispatch/AbstractNodeQueue.java 
b/actor/src/main/java/org/apache/pekko/dispatch/AbstractNodeQueue.java
index ac8921e155..b339d1e1f3 100644
--- a/actor/src/main/java/org/apache/pekko/dispatch/AbstractNodeQueue.java
+++ b/actor/src/main/java/org/apache/pekko/dispatch/AbstractNodeQueue.java
@@ -60,6 +60,7 @@ public abstract class AbstractNodeQueue<T> extends 
AtomicReference<AbstractNodeQ
             // if tail != head this is not going to change until producer 
makes progress
             // we can avoid reading the head and just spin on next until it 
shows up
             do {
+                Thread.onSpinWait();
                 next = tail.next();
             } while (next == null);
         }
@@ -168,6 +169,7 @@ public abstract class AbstractNodeQueue<T> extends 
AtomicReference<AbstractNodeQ
           // if tail != head this is not going to change until producer makes 
progress
           // we can avoid reading the head and just spin on next until it 
shows up
           do {
+              Thread.onSpinWait();
               next = tail.next();
           } while (next == null);
       }
@@ -208,7 +210,14 @@ public abstract class AbstractNodeQueue<T> extends 
AtomicReference<AbstractNodeQ
         }
 
         public final Node<T> next() {
-            return (Node<T>) nextHandle.get(this);
+            // Acquire load to pair with the release store in setNext. A plain 
read here
+            // (VarHandle.get has plain semantics even though the field is 
volatile) is not
+            // ordered against the producer's setRelease, so it establishes no 
happens-before
+            // with the published node and, inside the busy-spin loops in 
peekNode/pollNode,
+            // can be hoisted out of the loop by the JIT, producing an 
unbounded spin that
+            // never observes the linked next node. This was observed on JDK 
25 where such a
+            // spin pinned a virtual-thread carrier at 100% CPU and stalled 
the stream.
+            return (Node<T>) nextHandle.getAcquire(this);
         }
 
         protected final void setNext(final Node<T> newNext) {


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(pekko) 01/01: fix: pair AbstractNodeQueue next read with acquire semantics

Reply via email to