Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

Paul E. McKenney Thu, 08 Nov 2012 13:16:28 -0800

On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> On Thu, 8 Nov 2012 14:48:49 +0100
> Oleg Nesterov <[email protected]> wrote:
> 
> > Currently the writer does msleep() plus synchronize_sched() 3 times
> > to acquire/release the semaphore, and during this time the readers
> > are blocked completely. Even if the "write" section was not actually
> > started or if it was already finished.
> > 
> > With this patch down_write/up_write does synchronize_sched() twice
> > and down_read/up_read are still possible during this time, just they
> > use the slow path.
> > 
> > percpu_down_write() first forces the readers to use rw_semaphore and
> > increment the "slow" counter to take the lock for reading, then it
> > takes that rw_semaphore for writing and blocks the readers.
> > 
> > Also. With this patch the code relies on the documented behaviour of
> > synchronize_sched(), it doesn't try to pair synchronize_sched() with
> > barrier.
> > 
> > ...
> >
> >  include/linux/percpu-rwsem.h |   83 +++++------------------------
> >  lib/Makefile                 |    2 +-
> >  lib/percpu-rwsem.c           |  123 
> > ++++++++++++++++++++++++++++++++++++++++++
> 
> The patch also uninlines everything.
> 
> And it didn't export the resulting symbols to modules, so it isn't an
> equivalent.  We can export thing later if needed I guess.
> 
> It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> avoid including the code altogether, methinks?
> 
> >
> > ...
> >
> > --- /dev/null
> > +++ b/lib/percpu-rwsem.c
> > @@ -0,0 +1,123 @@
> 
> That was nice and terse ;)
> 
> > +#include <linux/percpu-rwsem.h>
> > +#include <linux/rcupdate.h>
> > +#include <linux/sched.h>
> 
> This list is nowhere near sufficient to support this file's
> requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> more.  IOW, if it compiles, it was sheer luck.
> 
> > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> > +{
> > +   brw->fast_read_ctr = alloc_percpu(int);
> > +   if (unlikely(!brw->fast_read_ctr))
> > +           return -ENOMEM;
> > +
> > +   mutex_init(&brw->writer_mutex);
> > +   init_rwsem(&brw->rw_sem);
> > +   atomic_set(&brw->slow_read_ctr, 0);
> > +   init_waitqueue_head(&brw->write_waitq);
> > +   return 0;
> > +}
> > +
> > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> > +{
> > +   free_percpu(brw->fast_read_ctr);
> > +   brw->fast_read_ctr = NULL; /* catch use after free bugs */
> > +}
> > +
> > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int 
> > val)
> > +{
> > +   bool success = false;
> > +
> > +   preempt_disable();
> > +   if (likely(!mutex_is_locked(&brw->writer_mutex))) {
> > +           __this_cpu_add(*brw->fast_read_ctr, val);
> > +           success = true;
> > +   }
> > +   preempt_enable();
> > +
> > +   return success;
> > +}
> > +
> > +/*
> > + * Like the normal down_read() this is not recursive, the writer can
> > + * come after the first percpu_down_read() and create the deadlock.
> > + */
> > +void percpu_down_read(struct percpu_rw_semaphore *brw)
> > +{
> > +   if (likely(update_fast_ctr(brw, +1)))
> > +           return;
> > +
> > +   down_read(&brw->rw_sem);
> > +   atomic_inc(&brw->slow_read_ctr);
> > +   up_read(&brw->rw_sem);
> > +}
> > +
> > +void percpu_up_read(struct percpu_rw_semaphore *brw)
> > +{
> > +   if (likely(update_fast_ctr(brw, -1)))
> > +           return;
> > +
> > +   /* false-positive is possible but harmless */
> > +   if (atomic_dec_and_test(&brw->slow_read_ctr))
> > +           wake_up_all(&brw->write_waitq);
> > +}
> > +
> > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> > +{
> > +   unsigned int sum = 0;
> > +   int cpu;
> > +
> > +   for_each_possible_cpu(cpu) {
> > +           sum += per_cpu(*brw->fast_read_ctr, cpu);
> > +           per_cpu(*brw->fast_read_ctr, cpu) = 0;
> > +   }
> > +
> > +   return sum;
> > +}
> > +
> > +/*
> > + * A writer takes ->writer_mutex to exclude other writers and to force the
> > + * readers to switch to the slow mode, note the mutex_is_locked() check in
> > + * update_fast_ctr().
> > + *
> > + * After that the readers can only inc/dec the slow ->slow_read_ctr 
> > counter,
> > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> > + * counter it represents the number of active readers.
> > + *
> > + * Finally the writer takes ->rw_sem for writing and blocks the new 
> > readers,
> > + * then waits until the slow counter becomes zero.
> > + */
> 
> Some overview of how fast/slow_read_ctr are supposed to work would be
> useful.  This comment seems to assume that the reader already knew
> that.
> 
> > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > +{
> > +   /* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > +   mutex_lock(&brw->writer_mutex);
> > +
> > +   /*
> > +    * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > +    *    so that update_fast_ctr() can't succeed.
> > +    *
> > +    * 2. Ensures we see the result of every previous this_cpu_add() in
> > +    *    update_fast_ctr().
> > +    *
> > +    * 3. Ensures that if any reader has exited its critical section via
> > +    *    fast-path, it executes a full memory barrier before we return.
> > +    */
> > +   synchronize_sched();
> 
> Here's where I get horridly confused.  Your patch completely deRCUifies
> this code, yes?  Yet here we're using an RCU primitive.  And we seem to
> be using it not as an RCU primitive but as a handy thing which happens
> to have desirable side-effects.  But the implementation of
> synchronize_sched() differs considerably according to which rcu
> flavor-of-the-minute you're using.


The trick is that the preempt_disable() call in update_fast_ctr()
acts as an RCU read-side critical section WRT synchronize_sched().

The algorithm would work given rcu_read_lock()/rcu_read_unlock() and
synchronize_rcu() in place of preempt_disable()/preempt_enable() and
synchronize_sched().  The real-time guys would prefer the change
to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that
you mention it.

Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock()
and synchronize_rcu()?

                                                        Thanx, Paul

> And part 3 talks about the reader's critical section.  The only
> critical sections I can see on the reader side are already covered by
> mutex_lock() and preempt_diable().
> 
> I get this feeling I don't have clue what's going on here and I think
> I'll just retire hurt now.  If this code isn't as brain damaged as it
> initially appears then please, go easy on us simpletons in the next
> version?
> 
> > +   /* nobody can use fast_read_ctr, move its sum into slow_read_ctr */
> > +   atomic_add(clear_fast_ctr(brw), &brw->slow_read_ctr);
> > +
> > +   /* block the new readers completely */
> > +   down_write(&brw->rw_sem);
> > +
> > +   /* wait for all readers to complete their percpu_up_read() */
> > +   wait_event(brw->write_waitq, !atomic_read(&brw->slow_read_ctr));
> > +}
> > +
> > +void percpu_up_write(struct percpu_rw_semaphore *brw)
> > +{
> > +   /* allow the new readers, but only the slow-path */
> > +   up_write(&brw->rw_sem);
> > +
> > +   /* insert the barrier before the next fast-path in down_read */
> > +   synchronize_sched();
> > +
> > +   mutex_unlock(&brw->writer_mutex);
> > +}
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

Reply via email to