* Paolo Bonzini ([email protected]) wrote:
> I noticed that urcu makes exactly _one_ attempt at using futexes
> to avoid busy looping on synchronize_rcu.  The attached patch instead
> switches from busy waiting to futexes after RCU_QS_ACTIVE_ATTEMPTS.
> To limit the amount of system calls, reading threads remember whether
> they already had a quiescent state in this grace period; if so they were
> already removed from the list, and can avoid signaling the futex.
> 
> Performance measured with rcutorture (nreaders: 10, nupdaters: 1,
> duration: 10, median of nine runs):
> 
>      RCU_QS_ACTIVE_ATTEMPTS == 100, no patch         n_updates = 302
>      RCU_QS_ACTIVE_ATTEMPTS == 1, no patch           n_updates = 290
>      RCU_QS_ACTIVE_ATTEMPTS == 100, with patch       n_updates = 408
>      RCU_QS_ACTIVE_ATTEMPTS == 1, with patch         n_updates = 404
> 
> The first two cases are obviously the same; the only change is
> when the futex is used, but over many calls there is no difference.
> 
> Signed-off-by: Paolo Bonzini  <[email protected]>
> ---
>  urcu-qsbr.c             |   12 +++++++-----
>  urcu/static/urcu-qsbr.h |    6 +++++-
>  2 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/urcu-qsbr.c b/urcu-qsbr.c
> index a239be0..64a95e7 100644
> --- a/urcu-qsbr.c
> +++ b/urcu-qsbr.c
> @@ -150,26 +150,28 @@ static void update_counter_and_wait(void)
>        */
>       for (;;) {
>               wait_loops++;

Hrm, this seems buggy.

> -             if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
> -                     uatomic_dec(&gp_futex);
> +             if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {

Given there is no memory barriers between
  uatomic_set(&gp_futex, -1);
  _CMM_STORE_SHARED(index->waiting, 1);

we could very well see the effect of these operations being observable
in the reverse order. So if we get the index->waiting set to 1 before
gp_futex is set to -1, see below,

> +                     uatomic_set(&gp_futex, -1);
> +                     cds_list_for_each_entry(index, &registry, node) {
> +                             _CMM_STORE_SHARED(index->waiting, 1);
> +                     }
>                       /* Write futex before read reader_gp */
>                       cmm_smp_mb();
>               }
> -
>               cds_list_for_each_entry_safe(index, tmp, &registry, node) {
>                       if (!rcu_gp_ongoing(&index->ctr))
>                               cds_list_move(&index->node, &qsreaders);
>               }
>  
>               if (cds_list_empty(&registry)) {
> -                     if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
> +                     if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {
>                               /* Read reader_gp before write futex */
>                               cmm_smp_mb();
>                               uatomic_set(&gp_futex, 0);
>                       }
>                       break;
>               } else {
> -                     if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {
> +                     if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {
>                               wait_gp();
>                       } else {
>  #ifndef HAS_INCOHERENT_CACHES
> diff --git a/urcu/static/urcu-qsbr.h b/urcu/static/urcu-qsbr.h
> index 5b7adac..b2c5fbf 100644
> --- a/urcu/static/urcu-qsbr.h
> +++ b/urcu/static/urcu-qsbr.h
> @@ -124,6 +124,7 @@ struct rcu_reader {
>       unsigned long ctr;
>       /* Data used for registry */
>       struct cds_list_head node __attribute__((aligned(CAA_CACHE_LINE_SIZE)));
> +     int waiting;
>       pthread_t tid;
>  };
>  
> @@ -136,7 +137,10 @@ extern int32_t gp_futex;
>   */
>  static inline void wake_up_gp(void)
>  {
> -     if (unlikely(uatomic_read(&gp_futex) == -1)) {

The reader will see the index->waiting set to 1, reset it to 0, and
observe a gp_futex that is not -1 (yet, due to reordering of memory
accesses), thus return immediately. Then, the next time we get a call to
wake_up_gp() for the same update_counter_and_wait() execution, it will
see that rcu_reader.waiting is set to 0, because it has itself already
set it back to that value. Therefore, we'll miss a wakeup, which can
potentially delay grace periods indefinitely.

Please note that I am extremely careful when doing changes to the wakeup
code, and I went as far as to create a CPU-level promela model to make
sure the wakeups would never be missed in any allowed memory
access/instruction execution orders.

So adding complexity to this code path for the sake of accelerating the
speed of synchronize_rcu(), which is already considered as a slow path,
does not appear as an immediate win.

But I might be open to a simpler fix that just changes all the

-                     if (wait_loops == RCU_QS_ACTIVE_ATTEMPTS) {

to

+                     if (wait_loops >= RCU_QS_ACTIVE_ATTEMPTS) {

Because it does not alter the gp_futex and memory barriers semantics.
You could do that change to both urcu-qsbr.c and urcu.c. I would be
interested if this alone helps performance ?

If we choose to also alter the gp_futex behavior, we'll definitely have
to update the Promela model and make sure than can never lead to missed
wakeups.

Thanks,

Mathieu

> +     if (unlikely(_CMM_LOAD_SHARED(rcu_reader.waiting))) {
> +             _CMM_STORE_SHARED(rcu_reader.waiting, 0);
> +             if (uatomic_read(&gp_futex) != -1)
> +                     return;
>               uatomic_set(&gp_futex, 0);
>               futex_noasync(&gp_futex, FUTEX_WAKE, 1,
>                     NULL, NULL, 0);
> -- 
> 1.7.6
> 
> 
> _______________________________________________
> ltt-dev mailing list
> [email protected]
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

_______________________________________________
ltt-dev mailing list
[email protected]
http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev

Reply via email to