On Thu, 26 Nov 2020 at 16:02, Heikki Linnakangas <[email protected]> wrote:
> On 26/11/2020 06:30, Krunal Bauskar wrote:
> > Improving spin-lock implementation on ARM.
> > ------------------------------------------------------------
> >
> > * Spin-Lock is known to have a significant effect on performance
> > with increasing scalability.
> >
> > * Existing Spin-Lock implementation for ARM is sub-optimal due to
> > use of TAS (test and swap)
> >
> > * TAS is implemented on ARM as load-store so even if the lock is not
> free,
> > store operation will execute to replace the same value.
> > This redundant operation (mainly store) is costly.
> >
> > * CAS is implemented on ARM as load-check-store-check that means if the
> > lock is not free, check operation, post-load will cause the loop to
> > return there-by saving on costlier store operation. [1]
>
> Can you add some code comments to explain that why CAS is cheaper than
> TAS on ARM?
>
1. As Alexey too pointed out in followup email
CAS = load value -> check if the value is expected -> if yes then replace
(store new value) -> else exit/break
TAS = load value -> store new value
This means each TAS would be converted to 2 operations that are LOAD and
STORE and of-course
STORE is costlier in terms of latency. CAS ensures optimization in this
regard with an early check.
Let's look at some micro-benchmarking. I implemented a simple spin-loop
using both approaches and
as expected with increase scalability, CAS continues to out-perform TAS by
a margin up to 50%.
---- TAS ----
Running 128 parallelism
Elapsed time: 1.34271 s
Running 256 parallelism
Elapsed time: 3.6487 s
Running 512 parallelism
Elapsed time: 11.3725 s
Running 1024 parallelism
Elapsed time: 43.5376 s
---- CAS ----
Running 128 parallelism
Elapsed time: 1.00131 s
Running 256 parallelism
Elapsed time: 2.53202 s
Running 512 parallelism
Elapsed time: 7.66829 s
Running 1024 parallelism
Elapsed time: 22.6294 s
This could be also observed from the perf profiling
TAS:
15.57 │44: ldxr w0, [x19]
83.93 │ stxr w1, w21, [x19]
CAS:
81.29 │58: ↓ b.ne cc
....
9.86 │cc: ldaxr w0, [x22]
8.84 │ cmp w0, #0x0
│ ↑ b.ne 58
│ stxr w1, w20, [x22]
*In TAS: STORE is pretty costly.*
2. I have added the needed comment in the patch. Updated patch attached.
----------------------
Thanks for taking look at this and surely let me know if any more info is
needed.
>
> Is there some official ARM documentation, like a programmer's reference
> manual or something like that, that would show a reference
> implementation of a spinlock on ARM? It would be good to refer to an
> authoritative source on this.
>
> - Heikki
>
--
Regards,
Krunal Bauskar
#include <iostream>
#include <thread>
#include <chrono>
#include <zlib.h>
#include <string.h>
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
const size_t data_block_size = 64*1024;
char data_block[64*1024];
size_t k_iter=100;
unsigned long global_crcval = 0;
/* ---------------- spin-lock ---------------- */
uint32_t lock_unit;
void lock()
{
while (true) {
uint32_t expected = 0;
#ifdef CAS
if (!__atomic_compare_exchange_n(&lock_unit, &expected, 1, false,
__ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE)) {
#elif TAS
if (__sync_lock_test_and_set(&lock_unit, 1)) {
#endif
/* Spin-and-Retry as the lock is acquired by some other thread */
__asm__ __volatile__("" ::: "memory");
continue;
}
break;
}
}
void unlock()
{
#ifdef CAS
__atomic_store_n(&lock_unit, 0, __ATOMIC_RELEASE);
#else
__sync_lock_release(&lock_unit);
#endif
}
/* ---------------- workload ---------------- */
void workload_execute()
{
for (size_t i = 0; i < k_iter; ++i) {
lock();
/* Each thread try to take lock -> execute critical section -> unlock */
memset(data_block, rand() % 255, data_block_size);
unsigned long crcval = 0;
crc32(crcval, (const unsigned char *)data_block, data_block_size);
global_crcval += crcval;
unlock();
}
}
int main(int argc, char *argv[]) {
if (argc != 2) {
std::cerr << "usage: <program> <number-of-threads/parallelism>" << std::endl;
return 1;
}
size_t num_of_threads = atol(argv[1]);
std::thread* handles[num_of_threads];
auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_of_threads; ++i) {
handles[i] = new std::thread(workload_execute);
}
for (size_t i = 0; i < num_of_threads; ++i) {
handles[i]->join();
}
auto finish = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_of_threads; ++i) {
delete handles[i];
}
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time: " << elapsed.count() << " s\n";
}
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 31a5ca6fb3..5110de37d4 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -321,7 +321,33 @@ tas(volatile slock_t *lock)
* than other widths.
*/
#if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64)
-#ifdef HAVE_GCC__SYNC_INT32_TAS
+
+#ifdef HAVE_GCC__ATOMIC_INT32_CAS
+/* just reusing the same flag to avoid re-declaration of default tas functions below */
+#define HAS_TEST_AND_SET
+
+/*
+ * compare-and-swap on arm is load -> check if value = expected
+ * -> if yes: store
+ * -> if no: break
+ * vs swap is load -> store that means swap will always cause
+ * store to execute replacing the value with existing value too.
+ * store being write operation that too atomic is costlier so it is better
+ * avoided if possible that is acheived using compare-and-swap optimization
+ */
+#define TAS(lock) cas(lock)
+typedef int slock_t;
+
+static __inline__ int
+cas(volatile slock_t *lock)
+{
+ slock_t expected = 0;
+ return !(__atomic_compare_exchange_n(lock, &expected, (slock_t) 1,
+ false, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE));
+}
+
+#define S_UNLOCK(lock) __atomic_store_n(lock, (slock_t) 0, __ATOMIC_RELEASE);
+#elif HAVE_GCC__SYNC_INT32_TAS
#define HAS_TEST_AND_SET
#define TAS(lock) tas(lock)