On 08/09/2011 11:12 PM, Mathieu Desnoyers wrote:
The reader will see the index->waiting set to 1, reset it to 0, and
observe a gp_futex that is not -1 (yet, due to reordering of memory
accesses), thus return immediately.
You're right. I set gp_futex before index->waiting exactly for this
reason, but I was confused by the uatomic name in uatomic_set and didn't
add the memory barrier.
Please note that I am extremely careful when doing changes to the wakeup
code, and I went as far as to create a CPU-level promela model to make
sure the wakeups would never be missed in any allowed memory
access/instruction execution orders.
So adding complexity to this code path for the sake of accelerating the
speed of synchronize_rcu(), which is already considered as a slow path,
does not appear as an immediate win.
It is not accelerating synchronize_rcu(). It does two things:
1) by using futexes, it avoids burning CPU when a grace period is long.
It is actually effective even if the grace period is _not_ so long:
100 walks through the thread list take less than a millisecond, and you
do not want to call rcu_quiescent_state() that often;
2) once you're always using futexes, if you have frequent quiescent
states in one thread and more rare quiescent states in another, the
former thread will uselessly call FUTEX_WAKE on each quiescent state,
even though it is already in the next grace period.
But I might be open to a simpler fix that just changes all the
- if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
to
+ if (wait_loops>= RCU_QS_ACTIVE_ATTEMPTS) {
Because it does not alter the gp_futex and memory barriers semantics.
You could do that change to both urcu-qsbr.c and urcu.c. I would be
interested if this alone helps performance ?
That doesn't work. The first read-side thread that calls FUTEX_WAKE
will set gp_futex to zero, and you will never get a wakeup from a second
read-side thread.
If we choose to also alter the gp_futex behavior, we'll definitely have
to update the Promela model and make sure than can never lead to missed
wakeups.
Makes sense. Where are the models? I attach a patch with the correct
memory barrier for completeness.
Paolo
>From f3afba5236999ca0cfd4a71829ccdc4140676675 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <[email protected]>
Date: Tue, 9 Aug 2011 14:14:12 +0200
Subject: [PATCH] urcu-qsbr: avoid useless futex wakeups and burning CPU for
long grace periods
I noticed that urcu makes exactly _one_ attempt at using futexes
to avoid busy looping on synchronize_rcu. The attached patch instead
switches from busy waiting to futexes after RCU_QS_ACTIVE_ATTEMPTS.
To limit the amount of system calls, reading threads remember whether
they already had a quiescent state in this grace period; if so they were
already removed from the list, and can avoid signaling the futex.
Performance measured with rcutorture (nreaders: 10, nupdaters: 1,
duration: 10, median of nine runs):
RCU_QS_ACTIVE_ATTEMPTS == 100, no patch n_updates = 292
RCU_QS_ACTIVE_ATTEMPTS == 1, no patch n_updates = 290
RCU_QS_ACTIVE_ATTEMPTS == 100, with patch n_updates = 408
RCU_QS_ACTIVE_ATTEMPTS == 1, with patch n_updates = 404
(the first two cases are obviously the same; the only change is
when the futex is used, but over many calls there is no difference).
---
urcu-qsbr.c | 17 ++++++++++++-----
urcu/static/urcu-qsbr.h | 7 ++++++-
2 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/urcu-qsbr.c b/urcu-qsbr.c
index a239be0..8d8a9cf 100644
--- a/urcu-qsbr.c
+++ b/urcu-qsbr.c
@@ -150,26 +150,33 @@ static void update_counter_and_wait(void)
*/
for (;;) {
wait_loops++;
- if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
- uatomic_dec(&gp_futex);
+ if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {
+ uatomic_set(&gp_futex, -1);
+ /*
+ * Write futex before write waiting (the other side
+ * reads them in the opposite order).
+ */
+ cmm_smp_wmb();
+ cds_list_for_each_entry(index, ®istry, node) {
+ _CMM_STORE_SHARED(index->waiting, 1);
+ }
/* Write futex before read reader_gp */
cmm_smp_mb();
}
-
cds_list_for_each_entry_safe(index, tmp, ®istry, node) {
if (!rcu_gp_ongoing(&index->ctr))
cds_list_move(&index->node, &qsreaders);
}
if (cds_list_empty(®istry)) {
- if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
+ if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {
/* Read reader_gp before write futex */
cmm_smp_mb();
uatomic_set(&gp_futex, 0);
}
break;
} else {
- if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
+ if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {
wait_gp();
} else {
#ifndef HAS_INCOHERENT_CACHES
diff --git a/urcu/static/urcu-qsbr.h b/urcu/static/urcu-qsbr.h
index 5b7adac..da2edf6 100644
--- a/urcu/static/urcu-qsbr.h
+++ b/urcu/static/urcu-qsbr.h
@@ -124,6 +124,7 @@ struct rcu_reader {
unsigned long ctr;
/* Data used for registry */
struct cds_list_head node __attribute__((aligned(CAA_CACHE_LINE_SIZE)));
+ int waiting;
pthread_t tid;
};
@@ -136,7 +137,11 @@ extern int32_t gp_futex;
*/
static inline void wake_up_gp(void)
{
- if (unlikely(uatomic_read(&gp_futex) == -1)) {
+ if (unlikely(_CMM_LOAD_SHARED(rcu_reader.waiting))) {
+ _CMM_STORE_SHARED(rcu_reader.waiting, 0);
+ cmm_smp_mb();
+ if (uatomic_read(&gp_futex) != -1)
+ return;
uatomic_set(&gp_futex, 0);
futex_noasync(&gp_futex, FUTEX_WAKE, 1,
NULL, NULL, 0);
--
1.7.6
_______________________________________________
ltt-dev mailing list
[email protected]
http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev