Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-24 Thread Christoph Lameter
On Fri, 24 Oct 2014, Peter Zijlstra wrote:

> The hold time isn't relevant, in fact breaking up the mmap_sem such that
> we require multiple acquisitions will just increase the cacheline
> bouncing.

Well this wont be happening anymore once you RCUify the stuff. If you go
to sleep then its best to release mmap_sem and then the bouncing wont
matter.

Dropping mmap_sem there will also expose you to races you will see later
too when you RCUify the code paths. That way those can be deal with
beforehand.

> Also I think it makes more sense to continue an entire fault operation,
> including blocking, if at all possible. Every retry will just waste more
> time.

Ok then dont retry. Just drop mmap_sem before going to sleep. When you
come back evaluate the situation and if we can proceed do so otherwise
retry.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-24 Thread Peter Zijlstra
On Fri, Oct 24, 2014 at 10:16:24AM -0500, Christoph Lameter wrote:

> Hmmm... One optimization to do before we get into these changes is to work
> on allowing the dropping of mmap_sem before we get to sleeping and I/O and
> then reevaluate when I/O etc is complete? This is probably the longest
> hold on mmap_sem that is also frequent. Then it may be easier to use
> standard RCU later.

The hold time isn't relevant, in fact breaking up the mmap_sem such that
we require multiple acquisitions will just increase the cacheline
bouncing.

Also I think it makes more sense to continue an entire fault operation,
including blocking, if at all possible. Every retry will just waste more
time.

Also, there is a lot of possible blocking, there's lock_page,
page_mkwrite() -- which ends up calling into the dirty throttle etc. We
could not possibly retry on all that, the error paths involved would be
horrible for one.

That said, there's a fair bit of code that does allow the retry, and I
think most fault paths actually do the retry on IO.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-24 Thread Christoph Lameter
On Tue, 21 Oct 2014, Peter Zijlstra wrote:

> On Mon, Oct 20, 2014 at 04:41:45PM -0700, Linus Torvalds wrote:
> > On Mon, Oct 20, 2014 at 2:56 PM, Peter Zijlstra  
> > wrote:
> > > Manage the VMAs with SRCU such that we can do a lockless VMA lookup.
> >
> > Can you explain why srcu, and not plain regular rcu?
> >
> > Especially as you then *note* some of the problems srcu can have.
> > Making it regular rcu would also seem to make it possible to make the
> > seqlock be just a seqcount, no?
>
> Because we need to hold onto the RCU read side lock across the entire
> fault, which can involve IO and all kinds of other blocking ops.

Hmmm... One optimization to do before we get into these changes is to work
on allowing the dropping of mmap_sem before we get to sleeping and I/O and
then reevaluate when I/O etc is complete? This is probably the longest
hold on mmap_sem that is also frequent. Then it may be easier to use
standard RCU later.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-24 Thread Peter Zijlstra
On Fri, Oct 24, 2014 at 11:33:58AM +0800, Lai Jiangshan wrote:
> On 10/23/2014 07:03 PM, Peter Zijlstra wrote:
> > On Thu, Oct 23, 2014 at 06:14:45PM +0800, Lai Jiangshan wrote:
> >>
> >>>  
> >>> +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long 
> >>> addr)
> >>> +{
> >>> + struct vm_area_struct *vma;
> >>> + unsigned int seq;
> >>> +
> >>> + WARN_ON_ONCE(!srcu_read_lock_held(_srcu));
> >>> +
> >>> + do {
> >>> + seq = read_seqbegin(>mm_seq);
> >>> + vma = __find_vma(mm, addr);
> >>
> >> will the __find_vma() loops for ever due to the rotations in the RBtree?
> > 
> > No, a rotation takes a tree and generates a tree, furthermore the
> > rotation has a fairly strict fwd progress guarantee seeing how its now
> > done with preemption disabled.
> 
> I can't get the magic.
> 
> __find_vma is visiting vma_a,
> vma_a is rotated to near the top due to multiple updates to the mm.
> __find_vma is visiting down to near the bottom, vma_b.
> now vma_b is rotated up to near the top again.
> __find_vma is visiting down to near the bottom, vma_c.
> now vma_c is rotated up to near the top again.
> 
> ...

Why would there be that much rotations? Is this a scenario where someone
is endlessly changing the tree?

If you stop updating the tree, the traversal will finish.

This is no different to the reader starvation already present with
seqlocks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-24 Thread Peter Zijlstra
On Fri, Oct 24, 2014 at 11:33:58AM +0800, Lai Jiangshan wrote:
 On 10/23/2014 07:03 PM, Peter Zijlstra wrote:
  On Thu, Oct 23, 2014 at 06:14:45PM +0800, Lai Jiangshan wrote:
 
   
  +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long 
  addr)
  +{
  + struct vm_area_struct *vma;
  + unsigned int seq;
  +
  + WARN_ON_ONCE(!srcu_read_lock_held(vma_srcu));
  +
  + do {
  + seq = read_seqbegin(mm-mm_seq);
  + vma = __find_vma(mm, addr);
 
  will the __find_vma() loops for ever due to the rotations in the RBtree?
  
  No, a rotation takes a tree and generates a tree, furthermore the
  rotation has a fairly strict fwd progress guarantee seeing how its now
  done with preemption disabled.
 
 I can't get the magic.
 
 __find_vma is visiting vma_a,
 vma_a is rotated to near the top due to multiple updates to the mm.
 __find_vma is visiting down to near the bottom, vma_b.
 now vma_b is rotated up to near the top again.
 __find_vma is visiting down to near the bottom, vma_c.
 now vma_c is rotated up to near the top again.
 
 ...

Why would there be that much rotations? Is this a scenario where someone
is endlessly changing the tree?

If you stop updating the tree, the traversal will finish.

This is no different to the reader starvation already present with
seqlocks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-24 Thread Christoph Lameter
On Tue, 21 Oct 2014, Peter Zijlstra wrote:

 On Mon, Oct 20, 2014 at 04:41:45PM -0700, Linus Torvalds wrote:
  On Mon, Oct 20, 2014 at 2:56 PM, Peter Zijlstra pet...@infradead.org 
  wrote:
   Manage the VMAs with SRCU such that we can do a lockless VMA lookup.
 
  Can you explain why srcu, and not plain regular rcu?
 
  Especially as you then *note* some of the problems srcu can have.
  Making it regular rcu would also seem to make it possible to make the
  seqlock be just a seqcount, no?

 Because we need to hold onto the RCU read side lock across the entire
 fault, which can involve IO and all kinds of other blocking ops.

Hmmm... One optimization to do before we get into these changes is to work
on allowing the dropping of mmap_sem before we get to sleeping and I/O and
then reevaluate when I/O etc is complete? This is probably the longest
hold on mmap_sem that is also frequent. Then it may be easier to use
standard RCU later.





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-24 Thread Peter Zijlstra
On Fri, Oct 24, 2014 at 10:16:24AM -0500, Christoph Lameter wrote:

 Hmmm... One optimization to do before we get into these changes is to work
 on allowing the dropping of mmap_sem before we get to sleeping and I/O and
 then reevaluate when I/O etc is complete? This is probably the longest
 hold on mmap_sem that is also frequent. Then it may be easier to use
 standard RCU later.

The hold time isn't relevant, in fact breaking up the mmap_sem such that
we require multiple acquisitions will just increase the cacheline
bouncing.

Also I think it makes more sense to continue an entire fault operation,
including blocking, if at all possible. Every retry will just waste more
time.

Also, there is a lot of possible blocking, there's lock_page,
page_mkwrite() -- which ends up calling into the dirty throttle etc. We
could not possibly retry on all that, the error paths involved would be
horrible for one.

That said, there's a fair bit of code that does allow the retry, and I
think most fault paths actually do the retry on IO.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-24 Thread Christoph Lameter
On Fri, 24 Oct 2014, Peter Zijlstra wrote:

 The hold time isn't relevant, in fact breaking up the mmap_sem such that
 we require multiple acquisitions will just increase the cacheline
 bouncing.

Well this wont be happening anymore once you RCUify the stuff. If you go
to sleep then its best to release mmap_sem and then the bouncing wont
matter.

Dropping mmap_sem there will also expose you to races you will see later
too when you RCUify the code paths. That way those can be deal with
beforehand.

 Also I think it makes more sense to continue an entire fault operation,
 including blocking, if at all possible. Every retry will just waste more
 time.

Ok then dont retry. Just drop mmap_sem before going to sleep. When you
come back evaluate the situation and if we can proceed do so otherwise
retry.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-23 Thread Lai Jiangshan
On 10/23/2014 07:03 PM, Peter Zijlstra wrote:
> On Thu, Oct 23, 2014 at 06:14:45PM +0800, Lai Jiangshan wrote:
>>
>>>  
>>> +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long 
>>> addr)
>>> +{
>>> +   struct vm_area_struct *vma;
>>> +   unsigned int seq;
>>> +
>>> +   WARN_ON_ONCE(!srcu_read_lock_held(_srcu));
>>> +
>>> +   do {
>>> +   seq = read_seqbegin(>mm_seq);
>>> +   vma = __find_vma(mm, addr);
>>
>> will the __find_vma() loops for ever due to the rotations in the RBtree?
> 
> No, a rotation takes a tree and generates a tree, furthermore the
> rotation has a fairly strict fwd progress guarantee seeing how its now
> done with preemption disabled.

I can't get the magic.

__find_vma is visiting vma_a,
vma_a is rotated to near the top due to multiple updates to the mm.
__find_vma is visiting down to near the bottom, vma_b.
now vma_b is rotated up to near the top again.
__find_vma is visiting down to near the bottom, vma_c.
now vma_c is rotated up to near the top again.

...




> 
> Therefore, even if we're in a node that's being rotated up, we can only
> 'loop' for as long as it takes for the new pointer stores to become
> visible on our CPU.
> 
> Thus we have a tree descent termination guarantee.
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-23 Thread Peter Zijlstra
On Thu, Oct 23, 2014 at 06:14:45PM +0800, Lai Jiangshan wrote:
> 
> >  
> > +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long 
> > addr)
> > +{
> > +   struct vm_area_struct *vma;
> > +   unsigned int seq;
> > +
> > +   WARN_ON_ONCE(!srcu_read_lock_held(_srcu));
> > +
> > +   do {
> > +   seq = read_seqbegin(>mm_seq);
> > +   vma = __find_vma(mm, addr);
> 
> will the __find_vma() loops for ever due to the rotations in the RBtree?

No, a rotation takes a tree and generates a tree, furthermore the
rotation has a fairly strict fwd progress guarantee seeing how its now
done with preemption disabled.

Therefore, even if we're in a node that's being rotated up, we can only
'loop' for as long as it takes for the new pointer stores to become
visible on our CPU.

Thus we have a tree descent termination guarantee.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-23 Thread Lai Jiangshan

>  
> +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long 
> addr)
> +{
> + struct vm_area_struct *vma;
> + unsigned int seq;
> +
> + WARN_ON_ONCE(!srcu_read_lock_held(_srcu));
> +
> + do {
> + seq = read_seqbegin(>mm_seq);
> + vma = __find_vma(mm, addr);

will the __find_vma() loops for ever due to the rotations in the RBtree?

> + } while (read_seqretry(>mm_seq, seq));
> +
> + return vma;
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-23 Thread Lai Jiangshan

  
 +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long 
 addr)
 +{
 + struct vm_area_struct *vma;
 + unsigned int seq;
 +
 + WARN_ON_ONCE(!srcu_read_lock_held(vma_srcu));
 +
 + do {
 + seq = read_seqbegin(mm-mm_seq);
 + vma = __find_vma(mm, addr);

will the __find_vma() loops for ever due to the rotations in the RBtree?

 + } while (read_seqretry(mm-mm_seq, seq));
 +
 + return vma;
 +}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-23 Thread Peter Zijlstra
On Thu, Oct 23, 2014 at 06:14:45PM +0800, Lai Jiangshan wrote:
 
   
  +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long 
  addr)
  +{
  +   struct vm_area_struct *vma;
  +   unsigned int seq;
  +
  +   WARN_ON_ONCE(!srcu_read_lock_held(vma_srcu));
  +
  +   do {
  +   seq = read_seqbegin(mm-mm_seq);
  +   vma = __find_vma(mm, addr);
 
 will the __find_vma() loops for ever due to the rotations in the RBtree?

No, a rotation takes a tree and generates a tree, furthermore the
rotation has a fairly strict fwd progress guarantee seeing how its now
done with preemption disabled.

Therefore, even if we're in a node that's being rotated up, we can only
'loop' for as long as it takes for the new pointer stores to become
visible on our CPU.

Thus we have a tree descent termination guarantee.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-23 Thread Lai Jiangshan
On 10/23/2014 07:03 PM, Peter Zijlstra wrote:
 On Thu, Oct 23, 2014 at 06:14:45PM +0800, Lai Jiangshan wrote:

  
 +struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned long 
 addr)
 +{
 +   struct vm_area_struct *vma;
 +   unsigned int seq;
 +
 +   WARN_ON_ONCE(!srcu_read_lock_held(vma_srcu));
 +
 +   do {
 +   seq = read_seqbegin(mm-mm_seq);
 +   vma = __find_vma(mm, addr);

 will the __find_vma() loops for ever due to the rotations in the RBtree?
 
 No, a rotation takes a tree and generates a tree, furthermore the
 rotation has a fairly strict fwd progress guarantee seeing how its now
 done with preemption disabled.

I can't get the magic.

__find_vma is visiting vma_a,
vma_a is rotated to near the top due to multiple updates to the mm.
__find_vma is visiting down to near the bottom, vma_b.
now vma_b is rotated up to near the top again.
__find_vma is visiting down to near the bottom, vma_c.
now vma_c is rotated up to near the top again.

...




 
 Therefore, even if we're in a node that's being rotated up, we can only
 'loop' for as long as it takes for the new pointer stores to become
 visible on our CPU.
 
 Thus we have a tree descent termination guarantee.
 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-21 Thread Peter Zijlstra
On Mon, Oct 20, 2014 at 04:41:45PM -0700, Linus Torvalds wrote:
> On Mon, Oct 20, 2014 at 2:56 PM, Peter Zijlstra  wrote:
> > Manage the VMAs with SRCU such that we can do a lockless VMA lookup.
> 
> Can you explain why srcu, and not plain regular rcu?
> 
> Especially as you then *note* some of the problems srcu can have.
> Making it regular rcu would also seem to make it possible to make the
> seqlock be just a seqcount, no?

Ah, the reason I did the seqlock is because the read side will spin-wait
for &1 to go away. If the write side is preemptible that's horrid. I
used seqlock because that takes a lock (and thus disables preemption) on
the write side, but I could equally have done:

preempt_disable();
write_seqcount_begin();

...

write_seqcount_end();
preempt_disable();

Since the lock is indeed superfluous, we're already fully serialized by
mmap_sem in this path.

Using regular RCU isn't sufficient, because of PREEMPT_RCU.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-21 Thread Peter Zijlstra
On Mon, Oct 20, 2014 at 04:41:45PM -0700, Linus Torvalds wrote:
> On Mon, Oct 20, 2014 at 2:56 PM, Peter Zijlstra  wrote:
> > Manage the VMAs with SRCU such that we can do a lockless VMA lookup.
> 
> Can you explain why srcu, and not plain regular rcu?
> 
> Especially as you then *note* some of the problems srcu can have.
> Making it regular rcu would also seem to make it possible to make the
> seqlock be just a seqcount, no?

Because we need to hold onto the RCU read side lock across the entire
fault, which can involve IO and all kinds of other blocking ops.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-21 Thread Peter Zijlstra
On Mon, Oct 20, 2014 at 04:41:45PM -0700, Linus Torvalds wrote:
 On Mon, Oct 20, 2014 at 2:56 PM, Peter Zijlstra pet...@infradead.org wrote:
  Manage the VMAs with SRCU such that we can do a lockless VMA lookup.
 
 Can you explain why srcu, and not plain regular rcu?
 
 Especially as you then *note* some of the problems srcu can have.
 Making it regular rcu would also seem to make it possible to make the
 seqlock be just a seqcount, no?

Because we need to hold onto the RCU read side lock across the entire
fault, which can involve IO and all kinds of other blocking ops.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-21 Thread Peter Zijlstra
On Mon, Oct 20, 2014 at 04:41:45PM -0700, Linus Torvalds wrote:
 On Mon, Oct 20, 2014 at 2:56 PM, Peter Zijlstra pet...@infradead.org wrote:
  Manage the VMAs with SRCU such that we can do a lockless VMA lookup.
 
 Can you explain why srcu, and not plain regular rcu?
 
 Especially as you then *note* some of the problems srcu can have.
 Making it regular rcu would also seem to make it possible to make the
 seqlock be just a seqcount, no?

Ah, the reason I did the seqlock is because the read side will spin-wait
for 1 to go away. If the write side is preemptible that's horrid. I
used seqlock because that takes a lock (and thus disables preemption) on
the write side, but I could equally have done:

preempt_disable();
write_seqcount_begin();

...

write_seqcount_end();
preempt_disable();

Since the lock is indeed superfluous, we're already fully serialized by
mmap_sem in this path.

Using regular RCU isn't sufficient, because of PREEMPT_RCU.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-20 Thread Linus Torvalds
On Mon, Oct 20, 2014 at 2:56 PM, Peter Zijlstra  wrote:
> Manage the VMAs with SRCU such that we can do a lockless VMA lookup.

Can you explain why srcu, and not plain regular rcu?

Especially as you then *note* some of the problems srcu can have.
Making it regular rcu would also seem to make it possible to make the
seqlock be just a seqcount, no?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 4/6] SRCU free VMAs

2014-10-20 Thread Peter Zijlstra
Manage the VMAs with SRCU such that we can do a lockless VMA lookup.

We put the fput(vma->vm_file) in the SRCU callback, this keeps files
valid during speculative faults, this is possible due to the delayed
fput work by Al Viro -- do we need srcu_barrier() in unmount
someplace?

We guard the mm_rb tree with a seqlock (XXX could be a seqcount but
we'd have to disable preemption around the write side in order to make
the retry loop in __read_seqcount_begin() work) such that we can know
if the rb tree walk was correct. We cannot trust the restult of a
lockless tree walk in the face of concurrent tree rotations; although
we can trust on the termination of such walks -- tree rotations
guarantee the end result is a tree again after all.

Furthermore, we rely on the WMB implied by the
write_seqlock/count_begin() to separate the VMA initialization and the
publishing stores, analogous to the RELEASE in rcu_assign_pointer().
We also rely on the RMB from read_seqretry() to separate the vma load
from further loads like the smp_read_barrier_depends() in regular
RCU.

We must not touch the vmacache while doing SRCU lookups as that is not
properly serialized against changes. We update gap information after
publishing the VMA, but A) we don't use that and B) the seqlock
read side would fix that anyhow.

We clear vma->vm_rb for nodes removed from the vma tree such that we
can easily detect such 'dead' nodes, we rely on the WMB from
write_sequnlock() to separate the tree removal and clearing the node.

Provide find_vma_srcu() which wraps the required magic.

XXX: mmap()/munmap() heavy workloads might suffer from the global lock
in call_srcu() -- this is fixable with a 'better' SRCU implementation.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/mm_types.h |3 +
 kernel/fork.c|1 
 mm/init-mm.c |1 
 mm/internal.h|   18 +
 mm/mmap.c|   88 ---
 5 files changed, 91 insertions(+), 20 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -310,6 +311,7 @@ struct vm_area_struct {
struct mempolicy *vm_policy;/* NUMA policy for the VMA */
 #endif
seqcount_t vm_sequence;
+   struct rcu_head vm_rcu_head;
 };
 
 struct core_thread {
@@ -347,6 +349,7 @@ struct kioctx_table;
 struct mm_struct {
struct vm_area_struct *mmap;/* list of VMAs */
struct rb_root mm_rb;
+   seqlock_t mm_seq;
u32 vmacache_seqnum;   /* per-thread vmacache */
 #ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -553,6 +553,7 @@ static struct mm_struct *mm_init(struct
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
mm->vmacache_seqnum = 0;
+   seqlock_init(>mm_seq);
atomic_set(>mm_users, 1);
atomic_set(>mm_count, 1);
init_rwsem(>mmap_sem);
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,7 @@
 
 struct mm_struct init_mm = {
.mm_rb  = RB_ROOT,
+   .mm_seq = __SEQLOCK_UNLOCKED(init_mm.mm_seq),
.pgd= swapper_pg_dir,
.mm_users   = ATOMIC_INIT(2),
.mm_count   = ATOMIC_INIT(1),
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -14,6 +14,24 @@
 #include 
 #include 
 
+extern struct srcu_struct vma_srcu;
+
+extern struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned 
long addr);
+
+static inline bool vma_is_dead(struct vm_area_struct *vma, unsigned int 
sequence)
+{
+   int ret = RB_EMPTY_NODE(>vm_rb);
+   unsigned seq = ACCESS_ONCE(vma->vm_sequence.sequence);
+
+   /*
+* Matches both the wmb in write_seqlock_{begin,end}() and
+* the wmb in vma_rb_erase().
+*/
+   smp_rmb();
+
+   return ret || seq != sequence;
+}
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
 
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -247,6 +247,23 @@ void unlink_file_vma(struct vm_area_stru
}
 }
 
+DEFINE_SRCU(vma_srcu);
+
+static void __free_vma(struct rcu_head *head)
+{
+   struct vm_area_struct *vma =
+   container_of(head, struct vm_area_struct, vm_rcu_head);
+
+   if (vma->vm_file)
+   fput(vma->vm_file);
+   kmem_cache_free(vm_area_cachep, vma);
+}
+
+static void free_vma(struct vm_area_struct *vma)
+{
+   call_srcu(_srcu, >vm_rcu_head, __free_vma);
+}
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -257,10 +274,8 @@ static struct vm_area_struct *remove_vma
might_sleep();
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
-   if (vma->vm_file)
-   fput(vma->vm_file);
mpol_put(vma_policy(vma));
-   

[RFC][PATCH 4/6] SRCU free VMAs

2014-10-20 Thread Peter Zijlstra
Manage the VMAs with SRCU such that we can do a lockless VMA lookup.

We put the fput(vma-vm_file) in the SRCU callback, this keeps files
valid during speculative faults, this is possible due to the delayed
fput work by Al Viro -- do we need srcu_barrier() in unmount
someplace?

We guard the mm_rb tree with a seqlock (XXX could be a seqcount but
we'd have to disable preemption around the write side in order to make
the retry loop in __read_seqcount_begin() work) such that we can know
if the rb tree walk was correct. We cannot trust the restult of a
lockless tree walk in the face of concurrent tree rotations; although
we can trust on the termination of such walks -- tree rotations
guarantee the end result is a tree again after all.

Furthermore, we rely on the WMB implied by the
write_seqlock/count_begin() to separate the VMA initialization and the
publishing stores, analogous to the RELEASE in rcu_assign_pointer().
We also rely on the RMB from read_seqretry() to separate the vma load
from further loads like the smp_read_barrier_depends() in regular
RCU.

We must not touch the vmacache while doing SRCU lookups as that is not
properly serialized against changes. We update gap information after
publishing the VMA, but A) we don't use that and B) the seqlock
read side would fix that anyhow.

We clear vma-vm_rb for nodes removed from the vma tree such that we
can easily detect such 'dead' nodes, we rely on the WMB from
write_sequnlock() to separate the tree removal and clearing the node.

Provide find_vma_srcu() which wraps the required magic.

XXX: mmap()/munmap() heavy workloads might suffer from the global lock
in call_srcu() -- this is fixable with a 'better' SRCU implementation.

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 include/linux/mm_types.h |3 +
 kernel/fork.c|1 
 mm/init-mm.c |1 
 mm/internal.h|   18 +
 mm/mmap.c|   88 ---
 5 files changed, 91 insertions(+), 20 deletions(-)

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -14,6 +14,7 @@
 #include linux/uprobes.h
 #include linux/page-flags-layout.h
 #include linux/seqlock.h
+#include linux/srcu.h
 #include asm/page.h
 #include asm/mmu.h
 
@@ -310,6 +311,7 @@ struct vm_area_struct {
struct mempolicy *vm_policy;/* NUMA policy for the VMA */
 #endif
seqcount_t vm_sequence;
+   struct rcu_head vm_rcu_head;
 };
 
 struct core_thread {
@@ -347,6 +349,7 @@ struct kioctx_table;
 struct mm_struct {
struct vm_area_struct *mmap;/* list of VMAs */
struct rb_root mm_rb;
+   seqlock_t mm_seq;
u32 vmacache_seqnum;   /* per-thread vmacache */
 #ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -553,6 +553,7 @@ static struct mm_struct *mm_init(struct
mm-mmap = NULL;
mm-mm_rb = RB_ROOT;
mm-vmacache_seqnum = 0;
+   seqlock_init(mm-mm_seq);
atomic_set(mm-mm_users, 1);
atomic_set(mm-mm_count, 1);
init_rwsem(mm-mmap_sem);
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,7 @@
 
 struct mm_struct init_mm = {
.mm_rb  = RB_ROOT,
+   .mm_seq = __SEQLOCK_UNLOCKED(init_mm.mm_seq),
.pgd= swapper_pg_dir,
.mm_users   = ATOMIC_INIT(2),
.mm_count   = ATOMIC_INIT(1),
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -14,6 +14,24 @@
 #include linux/fs.h
 #include linux/mm.h
 
+extern struct srcu_struct vma_srcu;
+
+extern struct vm_area_struct *find_vma_srcu(struct mm_struct *mm, unsigned 
long addr);
+
+static inline bool vma_is_dead(struct vm_area_struct *vma, unsigned int 
sequence)
+{
+   int ret = RB_EMPTY_NODE(vma-vm_rb);
+   unsigned seq = ACCESS_ONCE(vma-vm_sequence.sequence);
+
+   /*
+* Matches both the wmb in write_seqlock_{begin,end}() and
+* the wmb in vma_rb_erase().
+*/
+   smp_rmb();
+
+   return ret || seq != sequence;
+}
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
 
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -247,6 +247,23 @@ void unlink_file_vma(struct vm_area_stru
}
 }
 
+DEFINE_SRCU(vma_srcu);
+
+static void __free_vma(struct rcu_head *head)
+{
+   struct vm_area_struct *vma =
+   container_of(head, struct vm_area_struct, vm_rcu_head);
+
+   if (vma-vm_file)
+   fput(vma-vm_file);
+   kmem_cache_free(vm_area_cachep, vma);
+}
+
+static void free_vma(struct vm_area_struct *vma)
+{
+   call_srcu(vma_srcu, vma-vm_rcu_head, __free_vma);
+}
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -257,10 +274,8 @@ static struct vm_area_struct *remove_vma
might_sleep();
if (vma-vm_ops  vma-vm_ops-close)
vma-vm_ops-close(vma);
-   

Re: [RFC][PATCH 4/6] SRCU free VMAs

2014-10-20 Thread Linus Torvalds
On Mon, Oct 20, 2014 at 2:56 PM, Peter Zijlstra pet...@infradead.org wrote:
 Manage the VMAs with SRCU such that we can do a lockless VMA lookup.

Can you explain why srcu, and not plain regular rcu?

Especially as you then *note* some of the problems srcu can have.
Making it regular rcu would also seem to make it possible to make the
seqlock be just a seqcount, no?

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/