Re: Question about CAS via LDREX/STREX on 32-bit arm

Reingruber, Richard Mon, 13 Mar 2023 03:26:11 -0700

> This whole coding lives in MacroAssembler::atomic_cas_bool, but that function
> is not always called. There are plenty of direct usages of ldrex+strex. There
> is even a condition argument for MacroAssembler::cas_for_lock_acquire to
> either do strex directly or to dive down into atomic_cas_bool.

The check of the strex result is not local in these cases.
MacroAssembler::cas_for_lock_acquire checks here: 
https://github.com/openjdk/jdk/blob/25e7ac226a3be9c064c0a65c398a8165596150f7/src/hotspot/cpu/arm/macroAssembler_arm.cpp#L1195
Failure handling is done in the slow_case.

From: Thomas Stüfe <thomas.stu...@gmail.com>
Date: Monday, 13. March 2023 at 11:07
To: Reingruber, Richard <richard.reingru...@sap.com>
Cc: porters-dev@openjdk.org <porters-dev@openjdk.org>, 
aarch32-port-...@openjdk.org <aarch32-port-...@openjdk.org>
Subject: Re: Question about CAS via LDREX/STREX on 32-bit arm
This whole coding lives in MacroAssembler::atomic_cas_bool, but that function 
is not always called. There are plenty of direct usages of ldrex+strex. There 
is even a condition argument for MacroAssembler::cas_for_lock_acquire to either 
do strex directly or to dive down into atomic_cas_bool.

On Mon, Mar 13, 2023 at 11:03 AM Reingruber, Richard 
<richard.reingru...@sap.com<mailto:richard.reingru...@sap.com>> wrote:
> Yes, I read it the same way. So we repeat the CAS for lost reservation. I'm
> interested in when this could happen and why it would be okay to sometimes
> omit this loop and do the "raw" LDREX-STREX sequence. See my original mail.

Hm, I don't understand. The loop in the sequence A-F is always there. How is it 
omitted?

From: Thomas Stüfe <thomas.stu...@gmail.com<mailto:thomas.stu...@gmail.com>>
Date: Monday, 13. March 2023 at 10:54
To: Reingruber, Richard 
<richard.reingru...@sap.com<mailto:richard.reingru...@sap.com>>
Cc: porters-dev@openjdk.org<mailto:porters-dev@openjdk.org> 
<porters-dev@openjdk.org<mailto:porters-dev@openjdk.org>>, 
aarch32-port-...@openjdk.org<mailto:aarch32-port-...@openjdk.org> 
<aarch32-port-...@openjdk.org<mailto:aarch32-port-...@openjdk.org>>
Subject: Re: Question about CAS via LDREX/STREX on 32-bit arm
Hi Richard :)

On Mon, Mar 13, 2023 at 10:02 AM Reingruber, Richard 
<richard.reingru...@sap.com<mailto:richard.reingru...@sap.com>> wrote:
> Hi ARM experts,

Hi Thomas, not at all an ARM expert... :)
but I think I understand the code.

> I am trying to understand how CAS is implemented on arm; in particular, 
> "MacroAssembler::atomic_cas_bool":

> MacroAssembler::atomic_cas_bool

> ```
>     assert_different_registers(tmp_reg, oldval, newval, base);
>     Label loop;
>     bind(loop);
> A   ldrex(tmp_reg, Address(base, offset));
> B   subs(tmp_reg, tmp_reg, oldval);
> C   strex(tmp_reg, newval, Address(base, offset), eq);
> D   cmp(tmp_reg, 1, eq);
> E   b(loop, eq);
> F   cmp(tmp_reg, 0);
>     if (tmpreg == noreg) {
>       pop(tmp_reg);
>     }
> ```

> It uses LDREX and STREX to perform a cas of *(base+offset) from oldval to 
> newval. It does so in a loop. The code distinguishes two failures: STREX 
> failing, and a "semantically failed" CAS.

> Here is what I think this code does:

> A) LDREX: tmp=*(base+offset)
> B) tmp -= oldvalue
>    If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
> C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise, omit.
>    After this, if the store succeeded, tmp_reg is 0, if the store failed its 
> 1.
> D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if 
> *(base+offset) had been modified before LDREX.
>    We now compare with 1 and ...
> E) ...repeat the loop if tmp_reg was 1

> So we loop until either *(base+offset) had been changed to some other value 
> concurrently before out LDREX. Or until our store succeeded.

> I wondered what the loop guards against. And why it would be okay sometimes 
> to omit it.

The loop is needed to try again if the reservation was lost until the STREX 
succeeds or *(base+offset) != oldvalue.

So there are two cases. The loop is left iff

(1) *(base+offset) != oldvalue
(2) the STREX succeeded

First it is important to understand that C, D, E are only executed if at B the 
eq-condition is set to true.
This is based on the "Conditional Execution" feature of ARM: execution of most 
instructions can be made dependent on a condition (see 
https://developer.arm.com/documentation/den0013/d/ARM-Thumb-Unified-Assembly-Language-Instructions/Instruction-set-basics/Conditional-execution?lang=en<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdeveloper.arm.com%2Fdocumentation%2Fden0013%2Fd%2FARM-Thumb-Unified-Assembly-Language-Instructions%2FInstruction-set-basics%2FConditional-execution%3Flang%3Den&data=05%7C01%7Crichard.reingruber%40sap.com%7C217dfeac228443e2b7e908db23aacb0e%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C638142988683859436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PE%2Fg2hyJRes%2FsQ5ZCjCCxIotC%2F9RSY7jpVIT6owG3Dc%3D&reserved=0>)

So in case (1) C, D, E are not executed because B indicates that *(base+offset) 
and oldvalue are not-eq and the loop is left.

Ah, thanks, that was my thinking error. I did not realize that CMP was also 
conditional. I assumed the "eq" in the CMP (D) was a condition for the CMP. 
Which makes no sense, as I know now, since CMP just does a sub and only needs 
two arguments. So that meant the full branch CDE was controlled from the 
subtraction result at B.

That also resolves the "1 has a double meaning" question. It hasn't.

In case (2) C, D, E are executed. At C, if the reservation from A still exists, 
tmp_reg will be set to 0 otherwise to 1. At E the branch is taken if D 
indicated tmp_reg == 1 (reservation was lost) otherwise the loop is left.

Yes, I read it the same way. So we repeat the CAS for lost reservation. I'm 
interested in when this could happen and why it would be okay to sometimes omit 
this loop and do the "raw" LDREX-STREX sequence. See my original mail.

I suspect it has something to do with context switches. That the kernel does a 
CLREX when we switch, so if we switch between LDREX and STREX, the reservation 
could be lost. But why would it then be okay to ignore this sometimes?

Thanks!

Thomas

Cheers, Richard.

From: porters-dev 
<porters-dev-r...@openjdk.org<mailto:porters-dev-r...@openjdk.org>> on behalf 
of Thomas Stüfe <thomas.stu...@gmail.com<mailto:thomas.stu...@gmail.com>>
Date: Saturday, 11. March 2023 at 11:19
To: porters-dev@openjdk.org<mailto:porters-dev@openjdk.org> 
<porters-dev@openjdk.org<mailto:porters-dev@openjdk.org>>, 
aarch32-port-...@openjdk.org<mailto:aarch32-port-...@openjdk.org> 
<aarch32-port-...@openjdk.org<mailto:aarch32-port-...@openjdk.org>>
Subject: Question about CAS via LDREX/STREX on 32-bit arm
Hi ARM experts,

I am trying to understand how CAS is implemented on arm; in particular, 
"MacroAssembler::atomic_cas_bool":

MacroAssembler::atomic_cas_bool

```
    assert_different_registers(tmp_reg, oldval, newval, base);
    Label loop;
    bind(loop);
A   ldrex(tmp_reg, Address(base, offset));
B   subs(tmp_reg, tmp_reg, oldval);
C   strex(tmp_reg, newval, Address(base, offset), eq);
D   cmp(tmp_reg, 1, eq);
E   b(loop, eq);
F   cmp(tmp_reg, 0);
    if (tmpreg == noreg) {
      pop(tmp_reg);
    }
```

It uses LDREX and STREX to perform a cas of *(base+offset) from oldval to 
newval. It does so in a loop. The code distinguishes two failures: STREX 
failing, and a "semantically failed" CAS.
Here is what I think this code does:

A) LDREX: tmp=*(base+offset)
B) tmp -= oldvalue
   If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise, omit.
   After this, if the store succeeded, tmp_reg is 0, if the store failed its 1.
D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if 
*(base+offset) had been modified before LDREX.
   We now compare with 1 and ...
E) ...repeat the loop if tmp_reg was 1

So we loop until either *(base+offset) had been changed to some other value 
concurrently before out LDREX. Or until our store succeeded.

I wondered what the loop guards against. And why it would be okay sometimes to 
omit it.

IIUC, STREX fails if the core did lose its exclusive access to the memory 
location since the LDREX. This can be one of three things, right? :
- another core slipped in an LDREX+STREX to the same location between our LDREX 
and STREX
- Or we context switched to another thread or process. I assume it does a CLREX 
then, right? Because how could you prevent a sequence like "LDREX(p1) -> switch 
-> LDREX(p2) -> switch back STREX(p1)" - if I understand the ARM manual [1] 
correctly, a STREX to a different location than the preceding LDREX is 
undefined.
- Or we had a signal after LDREX and did a second LDREX in the signal handler. 
Does the kernel do a CLREX when invoking a signal handler?

More questions:

- If I got it right, at (D), tmp_reg value "1" has two meanings: either STREX 
failed or some thread increased the value concurrently by 1. We repeat the loop 
either way. Is this just accepted behavior? Increasing by 1 is maybe not that 
rare.

- If I understood this correctly, the loop guards us mainly against context 
switches. Without the loop a context switch would count as a "semantically 
failed" CAS. Why would that be okay? Should we not do this loop always?

- Do we not need an explicit CLREX after the operation? Or does the STREX also 
clear the hardware monitor? Or does it just not matter?

- We have VM_Version::supports_ldrex(). Code seems to me sometimes guarded by 
this (e.g MacroAssembler::atomic_cas_bool), sometimes code just executes 
ldrex/strex (e.g. the one-shot path of MacroAssembler::cas_for_lock_acquire). 
Am I mistaken? Or is LDREX now generally available? Does ARMv6 mean STREX and 
LDREX are available?

Thanks a lot!

Cheers, Thomas

Re: Question about CAS via LDREX/STREX on 32-bit arm

Reply via email to