On Wed, 10 Jun 2026 17:35:43 GMT, Mat Carter <[email protected]> wrote:

> With -UseLSE the Starvation test on Windows ARM64 timeouts close to 100% 
> whereas this only happens on Linux ARM64 on larger machines with many cores. 
> The issue is that C2 outputs an LDR following the CAS in LinkedTransferQueue 
> which can execute before the STLXR breaking the Dekker protocol.
> 
> Replacing the LDR with LADR by using getAcquire solves the issue as it won't 
> be reordered before the STLXR.  This does impact the +UseLSE case as the LADR 
> was not necessary and is slightly more expensive than LDR.  But to handle 
> this case would require larger changes to Hotspot
> 
> Starvation test passes on Windows ARM64 and Linux ARM64, with no regressions 
> on tier1
> 
> ---------
> - [x] I confirm that I make this contribution in accordance with the [OpenJDK 
> Interim AI Policy](https://openjdk.org/legal/ai).

OK, so we have the following `cmpxchg`:

```c++
// UseLSE
mov     result, expected
casal   result, new_val, [addr]
cmp     result, expected

// !UseLSE
prfm    pstl1strm, [addr]
retry:
ldaxr   result, [addr]
cmp     result, expected
b.ne    done
stlxr   rscratch1, new_val, [addr]
cbnz    rscratch1, retry
done:


Well, in the `!UseLSE` case a `ldr` after `done:` may move up, `stlxr` doesn't 
prevent that. OK, sure, so the load may float. As far as I know, the same is 
true for `casal`, in that its store effect is not necessarily performed before 
the `ldr`.

Alright, so what's up with park and unpark then?


DualNode:🏞
  this.waiter = Thread::currentThread();
  FullFence();
  yield;

DualNode::unpark:
bool success = CAS(p, m, e) // src, expected, new_value
if (!success) {
  Thread w = p.waiter;
}


I assume that the failed CAS signals that there is a waiter, but since the read 
of the `p.waiter` isn't ordered with the CAS, it may read null and the waiting 
thread is never unparked. Yeah, so this code really needs the load-acquire, as 
this is a Dekker-style thingie.

I'm pretty sure that this is broken in the `UseLSE` case as well, it's just 
less likely to observe the borked behavior.

The code is generated by:

```c++
void MacroAssembler::cmpxchg(Register addr, Register expected,
                             Register new_val,
                             enum operand_size size,
                             bool acquire, bool release,
                             bool weak,
                             Register result) {
  if (result == noreg)  result = rscratch1;
  BLOCK_COMMENT("cmpxchg {");
  if (UseLSE) {
    mov(result, expected);
    lse_cas(result, new_val, addr, size, acquire, release, /*not_pair*/ true);
    compare_eq(result, expected, size);
#ifdef ASSERT
    // Poison rscratch1 which is written on !UseLSE branch
    mov(rscratch1, 0x1f1f1f1f1f1f1f1f);
#endif
  } else {
    Label retry_load, done;
    prfm(Address(addr), PSTL1STRM);
    bind(retry_load);
    load_exclusive(result, addr, size, acquire);
    compare_eq(result, expected, size);
    br(Assembler::NE, done);
    store_exclusive(rscratch1, new_val, addr, size, release);
    if (weak) {
      cmpw(rscratch1, 0u);  // If the store fails, return NE to our caller.
    } else {
      cbnzw(rscratch1, retry_load);
    }
    bind(done);
  }
  BLOCK_COMMENT("} cmpxchg");
}

-------------

PR Comment: https://git.openjdk.org/jdk/pull/31465#issuecomment-4710767471

Reply via email to