On 08/09/2011 11:12 PM, Mathieu Desnoyers wrote:
The reader will see the index->waiting set to 1, reset it to 0, and
observe a gp_futex that is not -1 (yet, due to reordering of memory
accesses), thus return immediately.

You're right. I set gp_futex before index->waiting exactly for this reason, but I was confused by the uatomic name in uatomic_set and didn't add the memory barrier.

Please note that I am extremely careful when doing changes to the wakeup
code, and I went as far as to create a CPU-level promela model to make
sure the wakeups would never be missed in any allowed memory
access/instruction execution orders.

So adding complexity to this code path for the sake of accelerating the
speed of synchronize_rcu(), which is already considered as a slow path,
does not appear as an immediate win.

It is not accelerating synchronize_rcu().  It does two things:

1) by using futexes, it avoids burning CPU when a grace period is long. It is actually effective even if the grace period is _not_ so long: 100 walks through the thread list take less than a millisecond, and you do not want to call rcu_quiescent_state() that often;

2) once you're always using futexes, if you have frequent quiescent states in one thread and more rare quiescent states in another, the former thread will uselessly call FUTEX_WAKE on each quiescent state, even though it is already in the next grace period.

But I might be open to a simpler fix that just changes all the

-                     if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {

to

+                     if (wait_loops>= RCU_QS_ACTIVE_ATTEMPTS) {

Because it does not alter the gp_futex and memory barriers semantics.
You could do that change to both urcu-qsbr.c and urcu.c. I would be
interested if this alone helps performance ?

That doesn't work. The first read-side thread that calls FUTEX_WAKE will set gp_futex to zero, and you will never get a wakeup from a second read-side thread.

If we choose to also alter the gp_futex behavior, we'll definitely have
to update the Promela model and make sure than can never lead to missed
wakeups.

Makes sense. Where are the models? I attach a patch with the correct memory barrier for completeness.

Paolo
>From f3afba5236999ca0cfd4a71829ccdc4140676675 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <[email protected]>
Date: Tue, 9 Aug 2011 14:14:12 +0200
Subject: [PATCH] urcu-qsbr: avoid useless futex wakeups and burning CPU for
 long grace periods

I noticed that urcu makes exactly _one_ attempt at using futexes
to avoid busy looping on synchronize_rcu.  The attached patch instead
switches from busy waiting to futexes after RCU_QS_ACTIVE_ATTEMPTS.
To limit the amount of system calls, reading threads remember whether
they already had a quiescent state in this grace period; if so they were
already removed from the list, and can avoid signaling the futex.

Performance measured with rcutorture (nreaders: 10, nupdaters: 1,
duration: 10, median of nine runs):

     RCU_QS_ACTIVE_ATTEMPTS == 100, no patch         n_updates = 292
     RCU_QS_ACTIVE_ATTEMPTS == 1, no patch           n_updates = 290
     RCU_QS_ACTIVE_ATTEMPTS == 100, with patch       n_updates = 408
     RCU_QS_ACTIVE_ATTEMPTS == 1, with patch         n_updates = 404

(the first two cases are obviously the same; the only change is
when the futex is used, but over many calls there is no difference).
---
 urcu-qsbr.c             |   17 ++++++++++++-----
 urcu/static/urcu-qsbr.h |    7 ++++++-
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/urcu-qsbr.c b/urcu-qsbr.c
index a239be0..8d8a9cf 100644
--- a/urcu-qsbr.c
+++ b/urcu-qsbr.c
@@ -150,26 +150,33 @@ static void update_counter_and_wait(void)
 	 */
 	for (;;) {
 		wait_loops++;
-		if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
-			uatomic_dec(&gp_futex);
+		if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {
+			uatomic_set(&gp_futex, -1);
+			/*
+			 * Write futex before write waiting (the other side
+			 * reads them in the opposite order).
+			 */
+			cmm_smp_wmb();
+			cds_list_for_each_entry(index, &registry, node) {
+				_CMM_STORE_SHARED(index->waiting, 1);
+			}
 			/* Write futex before read reader_gp */
 			cmm_smp_mb();
 		}
-
 		cds_list_for_each_entry_safe(index, tmp, &registry, node) {
 			if (!rcu_gp_ongoing(&index->ctr))
 				cds_list_move(&index->node, &qsreaders);
 		}
 
 		if (cds_list_empty(&registry)) {
-			if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
+			if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {
 				/* Read reader_gp before write futex */
 				cmm_smp_mb();
 				uatomic_set(&gp_futex, 0);
 			}
 			break;
 		} else {
-			if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
+			if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {
 				wait_gp();
 			} else {
 #ifndef HAS_INCOHERENT_CACHES
diff --git a/urcu/static/urcu-qsbr.h b/urcu/static/urcu-qsbr.h
index 5b7adac..da2edf6 100644
--- a/urcu/static/urcu-qsbr.h
+++ b/urcu/static/urcu-qsbr.h
@@ -124,6 +124,7 @@ struct rcu_reader {
 	unsigned long ctr;
 	/* Data used for registry */
 	struct cds_list_head node __attribute__((aligned(CAA_CACHE_LINE_SIZE)));
+	int waiting;
 	pthread_t tid;
 };
 
@@ -136,7 +137,11 @@ extern int32_t gp_futex;
  */
 static inline void wake_up_gp(void)
 {
-	if (unlikely(uatomic_read(&gp_futex) == -1)) {
+	if (unlikely(_CMM_LOAD_SHARED(rcu_reader.waiting))) {
+		_CMM_STORE_SHARED(rcu_reader.waiting, 0);
+		cmm_smp_mb();
+		if (uatomic_read(&gp_futex) != -1)
+			return;
 		uatomic_set(&gp_futex, 0);
 		futex_noasync(&gp_futex, FUTEX_WAKE, 1,
 		      NULL, NULL, 0);
-- 
1.7.6

_______________________________________________
ltt-dev mailing list
[email protected]
http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev

Reply via email to