Hi all, Dave and I have been working together to get ARM64 with MSVC functional. The attached patches accomplish that. Dave is the author of the first which addresses some build issues and fixes the spin_delay() semantics, I did the second which fixes some atomics in this combination.
PostgreSQL when compiled with MSVC on ARM64 architecture in particular
when optimizations are enabled (e.g., /O2), fails 027_stream_regress.
After some investigation and analysis of generated assembly code, Dave
Cramer and I have identified that the root cause is insufficient memory
barrier semantics in both atomic operations and spinlocks on ARM64 when
compiled with MSVC with /O2.
Dave knew I was in the process of setting up a Win11/ARM64/MSVC build
animal and pinged me with this issue. Dave got me started on the path
to finding the issue by sending me his work around:
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -744,6 +744,7 @@ static void
WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
* before the data page can be written out. This implements the basic
* WAL rule "write the log before the data".)
*/
+#pragma optimize("",off)
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata,
XLogRecPtr fpw_lsn,
@@ -1088,7 +1089,7 @@ XLogInsertRecord(XLogRecData *rdata,
return EndPos;
}
-
+#pragma optimize("",on)
/*
This pointed a finger at the atomics, so I started there. We used a few
tools, but worth noting is https://godbolt.org/ where we were able to
quickly see that the MSVC assembly was missing the "dmb" barriers on
this platform. I'm not sure how long this link will be valid, but in
the short term here's our investigation: https://godbolt.org/z/PPqfxe1bn
PROBLEM DESCRIPTION
PostgreSQL test failures occur intermittently on MSVC ARM64 builds,
manifesting as timing-dependent failures in critical sections
protected by spinlocks and atomic variables. The failures are
reproducible when the test suite is compiled with optimization flags
(/O2), particularly in the recovery/027_stream_regress test which
involves WAL replication and standby recovery.
The root cause has two components:
1. Atomic operations lack memory barriers on ARM64
2. MSVC spinlock implementation lacks memory barriers on ARM64
TECHNICAL ANALYSIS
PART 1: ATOMIC OPERATIONS MEMORY BARRIERS
GCC's __atomic_compare_exchange_n() with __ATOMIC_SEQ_CST semantics
generates a call to __aarch64_cas4_acq_rel(), which is a library
function that provides explicit acquire-release memory ordering
semantics through either:
* LSE path (modern ARM64): Using CASAL instruction with built-in
memory ordering [1][2]
* Legacy path (older ARM64): Using LDAXR/STLXR instructions with
explicit dmb sy instruction [3]
MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
atomic operation but does NOT emit the necessary Data Memory Barrier
(DMB) instructions [4][5].
PART 2: SPINLOCK IMPLEMENTATION LACKS BARRIERS
The MSVC spinlock implementation in src/include/storage/s_lock.h had
two issues on ARM64/MSVC:
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
#define S_UNLOCK(lock) do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
Issue 1: TAS() uses InterlockedCompareExchange without hardware barriers
The InterlockedCompareExchange intrinsic lacks full memory barrier
semantics on ARM64, identical to the atomic operations issue.
Issue 2: S_UNLOCK() uses only a compiler barrier
_ReadWriteBarrier() is a compiler barrier, NOT a hardware memory
barrier [6]. It prevents the compiler from reordering operations, but
the CPU can still reorder memory operations. This is fundamentally
insufficient for ARM64's weaker memory model.
For comparison, GCC's __sync_lock_release() emits actual hardware
barriers.
IMPACT ON 027_STREAM_REGRESS
The 027_stream_regress test involves WAL replication and standby
recovery — heavily dependent on synchronized access to shared memory
protected by spinlocks [7]. Without proper barriers on ARM64:
1. Thread A acquires spinlock (no full barrier emitted)
2. Thread A modifies shared WAL buffer
3. Thread B acquires spinlock before Thread A's writes become visible
4. Thread B reads stale WAL data
5. WAL replication gets corrupted or hangs indefinitely
6. Test times out waiting for standby to catch up
WHY ARM32 AND X86/X64 ARE UNAFFECTED
MSVC's _InterlockedCompareExchange does provide full memory barriers on:
* x86/x64: Memory barriers are implicit in the x86 memory model [8]
* ARM32: MSVC explicitly generates full barriers for ARM32 [5]
Only ARM64 lacks the necessary barriers, making this a platform-specific
issue.
ATTACHED SOLUTION
Add explicit DMB (Data Memory Barrier) instructions before and after
atomic operations and spinlock operations on ARM64 to provide sequential
consistency semantics.
0002: src/inclue/port/atomic/generic-msvc.h
Added platform-specific DMB macros that expand to
__dmb(_ARM64_BARRIER_SY) on ARM64.
Applied to all six atomic operations:
* pg_atomic_compare_exchange_u32_impl()
* pg_atomic_exchange_u32_impl()
* pg_atomic_fetch_add_u32_impl()
* pg_atomic_compare_exchange_u64_impl()
* pg_atomic_exchange_u64_impl()
* pg_atomic_fetch_add_u64_impl()
0001: src/include/storage/s_lock.h
Added ARM64-specific spinlock implementation with explicit DMB barriers [9]:
#if defined(_M_ARM64)
#define TAS(lock) tas_msvc_arm64(lock)
static __forceinline int
tas_msvc_arm64(volatile slock_t *lock)
{
int result;
/* Full barrier before atomic operation */
__dmb(_ARM64_BARRIER_SY);
/* Atomic compare-and-swap */
result = InterlockedCompareExchange(lock, 1, 0);
/* Full barrier after atomic operation */
__dmb(_ARM64_BARRIER_SY);
return result;
}
#define S_UNLOCK(lock)
do {
__dmb(_ARM64_BARRIER_SY); /* Full barrier before release /
((lock)) = 0;
} while (0)
#else
/* Non-ARM64 MSVC: existing implementation unchanged */
#endif
The spinlock acquire now ensures:
* Before CAS: All prior memory operations complete before
acquiring the lock.
* After CAS: The CAS completes before subsequent operations
access protected data
The spinlock release now ensures:
* Before writing 0: All critical section operations are visible
to other threads
You may ask: why two DMBs in the atomic operations instead of one?
GCC's non-LSE path (LDAXR/STLXR) uses only one DMB because:
* LDAXR (Load-Acquire Exclusive) provides half-barrier acquire
semantics [3]
* STLXR (Store-Release Exclusive) provides half-barrier release
semantics [3]
* One final dmb sy upgrades to full sequential consistency
Since _InterlockedCompareExchange provides NO barrier semantics on
ARM64, we must provide both halves:
* First DMB acts as a release barrier (ensures prior memory ops
complete before CAS)
* Second DMB acts as an acquire barrier (ensures subsequent memory
ops wait for CAS)
* Together they provide sequential consistency matching GCC's
semantics [3]
VERIFICATION
The fix has been verified by:
1. Spinlock fix resolves 027_stream_regress timeout: Test now passes
consistently on MSVC ARM64 with /O2 optimization without hanging
2. Assembly code inspection: Confirmed that dmb sy instructions now
appear in the optimized assembly for ARM64 builds
3. Platform compatibility: No regression on x86/x64 or ARM32 (macros
expand to no-ops; original code path unchanged)
WHY CLANG/LLVM ON MACOS ARM64 DOESN'T HAVE THIS PROBLEM
PostgreSQL builds successfully on Apple Silicon Macs (ARM64) without
the memory ordering issues observed on MSVC Windows ARM64. The
difference comes down to how Clang/LLVM and MSVC handle atomic
operations.
CLANG/LLVM APPROACH (macOS, Linux, Android ARM64)
Clang/LLVM uses GCC-compatible atomic builtins
(__atomic_compare_exchange_n, etc.) even on platforms where it's not
GCC [125][134]. The LLVM backend has an AtomicExpand pass that
properly expands these operations to include appropriate memory
barriers for the target architecture [134].
On ARM64, Clang generates:
__aarch64_cas4_acq_rel library calls (or CASAL instruction with LSE)
Proper acquire-release semantics built into the instruction sequence
Automatic full dmb sy barriers where needed This means PostgreSQL's
use of __sync_lock_test_and_set and _atomic* builtins work correctly
on macOS ARM64 without additional patches.
Phew... I hope I read all those docs correctly and got that right. Feel
free to let me know if I missed something. Looking forward to your
feedback and review so I can get this new build animal up and running.
best.
-greg
[1] ARM Developer: CAS Instructions
https://developer.arm.com/documentation/dui0801/latest/A64-Data-Transfer-Instructions/CASAB--CASALB--CASB--CASLB--A64-
[2] ARM Developer: Load-Acquire and Store-Release Instructions
https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions
[3] ARM Developer: Data Memory Barrier (DMB)
https://developer.arm.com/documentation/100069/0610/A64-General-Instructions/DMB?lang=en
[4] Microsoft Learn: _InterlockedCompareExchange Intrinsic Functions
https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedcompareexchange-intrinsic-functions?view=msvc-170
[5] Microsoft Learn: ARM Intrinsics - Memory Barriers
https://learn.microsoft.com/en-us/cpp/intrinsics/arm-intrinsics?view=msvc-170
[6] Microsoft Learn: _ReadWriteBarrier is a Compiler Barrier
https://learn.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics?view=msvc-170
[7] PostgreSQL: 027_stream_regress WAL replication testing
https://www.postgresql.org/message-id/[email protected]
[8] Intel Volume 3A: Memory Ordering
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
[9] Microsoft Developer Blog: The AArch64 processor - Barriers
https://devblogs.microsoft.com/oldnewthing/20220812-00/?p=106968
v1-0001-Address-build-issues-for-ARM64-using-MSVC.patch
Description: Binary data
v1-0002-DMB-barries-fix-memory-ordering-on-MSVC-ARM64.patch
Description: Binary data
