Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Jamie Lokier
Olof Johansson wrote:
> How's this? I went with get_val_no_fault(), since it isn't really a
> get_user.*() any more (ptr being passed in), and no_paging is a little
> misleading (not all faults are due to paging).

How ironic: I deliberately didn't choose "no_fault" because that
function *does* take page faults.  It only disables paging operations! :)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Linus Torvalds


On Wed, 23 Feb 2005, Olof Johansson wrote:
> 
> How's this? I went with get_val_no_fault(), since it isn't really a
> get_user.*() any more (ptr being passed in), and no_paging is a little
> misleading (not all faults are due to paging).

Applied with minor cosmetic changes. I'm like a dog who likes to pee on 
things to show his territory, so I changed "get_val_no_fault" to 
"get_futex_value_locked", and I made sure that the return value is 
sensible (return 0 or -EFAULT rather than the "__memcpy_from_user()" 
return value which is how many bytes we couldn't copy).

Not that we care (we just check the return value against zero anyway,
which is success in both cases), but the compiler should be able to
optimize it away, and it might avoid some confusion down the line..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Olof Johansson
On Wed, Feb 23, 2005 at 06:49:46PM +, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > > I suggest putting it into futex.c, and make it an inline function
> > > which takes "u32 __user *".
> > 
> > Agreed, except we've traditionally just made it "int __user *".
> 
> The type signatures in futex.c are a bit mixed up - most places say
> "int __user *" but sys_futex() says "u32 __user *".  get_futex_key
> uses sizeof(u32) to check the address.

How's this? I went with get_val_no_fault(), since it isn't really a
get_user.*() any more (ptr being passed in), and no_paging is a little
misleading (not all faults are due to paging).


-

Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retry.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson <[EMAIL PROTECTED]>


Index: linux-2.5/kernel/futex.c
===
--- linux-2.5.orig/kernel/futex.c   2005-02-21 16:09:38.0 -0600
+++ linux-2.5/kernel/futex.c2005-02-23 13:10:16.0 -0600
@@ -258,6 +258,18 @@
}
 }
 
+static inline int get_val_no_fault(int *dest, int __user *from)
+{
+   int ret;
+
+   inc_preempt_count();
+   ret = __copy_from_user_inatomic(dest, from, sizeof(int));
+   dec_preempt_count();
+   preempt_check_resched();
+
+   return ret;
+}
+
 /*
  * The hash bucket lock must be held when this is called.
  * Afterwards, the futex_q must not be accessed.
@@ -329,6 +341,7 @@
int ret, drop_count = 0;
unsigned int nqueued;
 
+ retry:
down_read(>mm->mmap_sem);
 
ret = get_futex_key(uaddr1, );
@@ -355,9 +368,20 @@
   before *uaddr1.  */
smp_mb();
 
-   if (get_user(curval, (int __user *)uaddr1) != 0) {
-   ret = -EFAULT;
-   goto out;
+   ret = get_val_no_fault(, (int __user *)uaddr1);
+
+   if (unlikely(ret)) {
+   /* If we would have faulted, release mmap_sem, fault
+* it in and start all over again.
+*/
+   up_read(>mm->mmap_sem);
+
+   ret = get_user(curval, (int __user *)uaddr1);
+
+   if (!ret)
+   goto retry;
+
+   return ret;
}
if (curval != *valp) {
ret = -EAGAIN;
@@ -480,6 +504,7 @@
int ret, curval;
struct futex_q q;
 
+ retry:
down_read(>mm->mmap_sem);
 
ret = get_futex_key(uaddr, );
@@ -508,9 +533,23 @@
 * We hold the mmap semaphore, so the mapping cannot have changed
 * since we looked it up in get_futex_key.
 */
-   if (get_user(curval, (int __user *)uaddr) != 0) {
-   ret = -EFAULT;
-   goto out_unqueue;
+
+   ret = get_val_no_fault(, (int __user *)uaddr);
+
+   if (unlikely(ret)) {
+   /* If we would have faulted, release mmap_sem, fault it in and
+* start all over again.
+*/
+   up_read(>mm->mmap_sem);
+
+   if (!unqueue_me()) /* There's a chance we got woken already */
+   return 0;
+
+   ret = get_user(curval, (int __user *)uaddr);
+
+   if (!ret)
+   goto retry;
+   return ret;
}
if (curval != val) {
ret = -EWOULDBLOCK;
Index: linux-2.5/mm/mempolicy.c
===
--- linux-2.5.orig/mm/mempolicy.c   2005-02-04 00:27:40.0 -0600
+++ linux-2.5/mm/mempolicy.c2005-02-23 12:53:22.0 -0600
@@ -524,9 +524,13 @@
} else
pval = pol->policy;
 
-   err = -EFAULT;
+   if (vma) {
+   up_read(>mm->mmap_sem);
+   vma = NULL;
+   }
+
if (policy && put_user(pval, policy))
-   goto out;
+   return -EFAULT;
 
err = 0;
if (nmask) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Jamie Lokier
Linus Torvalds wrote:
> > I suggest putting it into futex.c, and make it an inline function
> > which takes "u32 __user *".
> 
> Agreed, except we've traditionally just made it "int __user *".

The type signatures in futex.c are a bit mixed up - most places say
"int __user *" but sys_futex() says "u32 __user *".  get_futex_key
uses sizeof(u32) to check the address.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread David Howells
Olof Johansson <[EMAIL PROTECTED]> wrote:

> > Alternately, you could just have do_page_fault() do:
> > 
> > while (!down_read_trylock(>mm->mmap_sem))
> > continue;
> > 
> > However, note that this can suffer from starvation due to a never ending
> > flow of mixed write-locks and read-locks on other CPUs. Unlikely, true,
> > but not impossible.
> 
> How can this help? 
> 
> The semaphore is held for reading by the thread that faulted in
> futex_wait() -> get_user(), so no writers will be let through. Until the
> writer has been let through, the down_read_trylock will never succeed
> either. No forward progress can be made even with the above loop.

You're right. The "writers" would have to spin instead.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Olof Johansson
On Wed, Feb 23, 2005 at 06:22:04PM +, Jamie Lokier wrote:
> Olof Johansson wrote:
> > On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:
> > 
> > > > Otherwise, a preempt attempt in get_user would not be seen
> > > > until some future preempt_enable was executed.
> > > 
> > > True. I guess we should have a "preempt_check_resched()" there too. 
> > > That's 
> > > what "kunmap_atomic()" does too (which is what we rely on in the other 
> > > case we do this..)
> > 
> > Ok, this is getting complex enough to warrant get_user_inatomic(),
> > which means adding it to every arch's uaccess.h.
> > 
> > Below patch does so. Unfortunately I don't have a Viro setup with cross
> > compilers for nearly every arch, so I can't make sure it doesn't break
> > anything. But since I pasted the same code everywhere it shouldn't.
> 
> My turn to say uglee.

Yeah, I wasn't entirely happy about it, but it seems that suggestions
are coming in on how to do it better. :-)

> Firstly, get_user_inatomic is the wrong name.
> 
> "inatomic" in __copy_from_user_inatomic means it's called inside a
> non-premptable region (in atomic...).
> 
> Your macro get_user_inatomic is _not_ called inside a
> non-preemptable region, so it shouldn't be called "inatomic".
> 
> (A better name is get_user_no_paging).
>
> Secondly, does this _one_ use (it's not likely to be used elsewhere)
> justify copying & pasting the same code into every asm-*/uaccess,
> especially when the code is not in any way arch-specific?

Arjan suggested creating a linux/uaccess.h that includes asm/uaccess.h,
and start moving the users over since that's where the trend is moving
anyway (avoiding including asm/* from common code). futex.c would be
the first user, and could be followed by more later as a janitorial
patch. That'd mean only one addition of the common function instead of
having to add it to every arch.

> I suggest putting it into futex.c, and make it an inline function
> which takes "u32 __user *".

Sure, for now that's good enough. Above janitorial work could be done
later, if more users get introduced.


-Olof
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Linus Torvalds


On Wed, 23 Feb 2005, Jamie Lokier wrote:
> 
> I suggest putting it into futex.c, and make it an inline function
> which takes "u32 __user *".

Agreed, except we've traditionally just made it "int __user *".

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Jamie Lokier
Olof Johansson wrote:
> On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:
> 
> > > Otherwise, a preempt attempt in get_user would not be seen
> > > until some future preempt_enable was executed.
> > 
> > True. I guess we should have a "preempt_check_resched()" there too. That's 
> > what "kunmap_atomic()" does too (which is what we rely on in the other 
> > case we do this..)
> 
> Ok, this is getting complex enough to warrant get_user_inatomic(),
> which means adding it to every arch's uaccess.h.
> 
> Below patch does so. Unfortunately I don't have a Viro setup with cross
> compilers for nearly every arch, so I can't make sure it doesn't break
> anything. But since I pasted the same code everywhere it shouldn't.

My turn to say uglee.

Firstly, get_user_inatomic is the wrong name.

"inatomic" in __copy_from_user_inatomic means it's called inside a
non-premptable region (in atomic...).

Your macro get_user_inatomic is _not_ called inside a
non-preemptable region, so it shouldn't be called "inatomic".

(A better name is get_user_no_paging).

Secondly, does this _one_ use (it's not likely to be used elsewhere)
justify copying & pasting the same code into every asm-*/uaccess,
especially when the code is not in any way arch-specific?

I suggest putting it into futex.c, and make it an inline function
which takes "u32 __user *".

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Arjan van de Ven
On Wed, 2005-02-23 at 11:10 -0600, Olof Johansson wrote:
> On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:
> 
> > > Otherwise, a preempt attempt in get_user would not be seen
> > > until some future preempt_enable was executed.
> > 
> > True. I guess we should have a "preempt_check_resched()" there too. That's 
> > what "kunmap_atomic()" does too (which is what we rely on in the other 
> > case we do this..)
> 
> Ok, this is getting complex enough to warrant get_user_inatomic(),
> which means adding it to every arch's uaccess.h.
> 
> Below patch does so. Unfortunately I don't have a Viro setup with cross
> compilers for nearly every arch, so I can't make sure it doesn't break
> anything. But since I pasted the same code everywhere it shouldn't.

I hate to do this to you, but get_user is a bit horrible in that it is
an untyped interface. Fixing it is hard (ugh) but when adding new
variants should/could we consider to please make it a typed (eg inline
and not a define) interface please ?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Olof Johansson
On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:

> > Otherwise, a preempt attempt in get_user would not be seen
> > until some future preempt_enable was executed.
> 
> True. I guess we should have a "preempt_check_resched()" there too. That's 
> what "kunmap_atomic()" does too (which is what we rely on in the other 
> case we do this..)

Ok, this is getting complex enough to warrant get_user_inatomic(),
which means adding it to every arch's uaccess.h.

Below patch does so. Unfortunately I don't have a Viro setup with cross
compilers for nearly every arch, so I can't make sure it doesn't break
anything. But since I pasted the same code everywhere it shouldn't.


-

Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retry.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson <[EMAIL PROTECTED]>


Index: linux-2.5/kernel/futex.c
===
--- linux-2.5.orig/kernel/futex.c   2005-02-21 16:09:38.0 -0600
+++ linux-2.5/kernel/futex.c2005-02-23 10:55:37.0 -0600
@@ -329,6 +329,7 @@
int ret, drop_count = 0;
unsigned int nqueued;
 
+ retry:
down_read(>mm->mmap_sem);
 
ret = get_futex_key(uaddr1, );
@@ -355,9 +356,17 @@
   before *uaddr1.  */
smp_mb();
 
-   if (get_user(curval, (int __user *)uaddr1) != 0) {
-   ret = -EFAULT;
-   goto out;
+   ret = get_user_inatomic(curval, (int __user *)uaddr1);
+
+   if (unlikely(ret)) {
+   up_read(>mm->mmap_sem);
+   /* Re-do the access outside the lock */
+   ret = get_user(curval, (int __user *)uaddr1);
+
+   if (!ret)
+   goto retry;
+
+   return ret;
}
if (curval != *valp) {
ret = -EAGAIN;
@@ -480,6 +489,7 @@
int ret, curval;
struct futex_q q;
 
+ retry:
down_read(>mm->mmap_sem);
 
ret = get_futex_key(uaddr, );
@@ -508,9 +518,21 @@
 * We hold the mmap semaphore, so the mapping cannot have changed
 * since we looked it up in get_futex_key.
 */
-   if (get_user(curval, (int __user *)uaddr) != 0) {
-   ret = -EFAULT;
-   goto out_unqueue;
+
+   ret = get_user_inatomic(curval, (int __user *)uaddr);
+
+   if (unlikely(ret)) {
+   up_read(>mm->mmap_sem);
+
+   if (!unqueue_me()) /* There's a chance we got woken already */
+   return 0;
+
+   /* Re-do the access outside the lock */
+   ret = get_user(curval, (int __user *)uaddr);
+
+   if (!ret)
+   goto retry;
+   return ret;
}
if (curval != val) {
ret = -EWOULDBLOCK;
Index: linux-2.5/mm/mempolicy.c
===
--- linux-2.5.orig/mm/mempolicy.c   2005-02-04 00:27:40.0 -0600
+++ linux-2.5/mm/mempolicy.c2005-02-23 10:16:46.0 -0600
@@ -524,9 +524,13 @@
} else
pval = pol->policy;
 
-   err = -EFAULT;
+   if (vma) {
+   up_read(>mm->mmap_sem);
+   vma = NULL;
+   }
+
if (policy && put_user(pval, policy))
-   goto out;
+   return -EFAULT;
 
err = 0;
if (nmask) {
Index: linux-2.5/include/asm-ppc64/uaccess.h
===
--- linux-2.5.orig/include/asm-ppc64/uaccess.h  2005-02-04 00:27:13.0 
-0600
+++ linux-2.5/include/asm-ppc64/uaccess.h   2005-02-23 11:01:44.0 
-0600
@@ -217,6 +217,17 @@
: "=r"(err), "=r"(x)\
: "b"(addr), "i"(errret), "0"(err))
 
+
+#define get_user_inatomic(x,ptr)   \
+({ \
+   int __ret;  \
+   inc_preempt_count();\
+   __ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr)));   \
+   dec_preempt_count();\
+   preempt_check_resched();\
+   __ret;  \
+})
+
 /* more complex routines 

Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Olof Johansson
On Wed, Feb 23, 2005 at 11:39:08AM +, David Howells wrote:
> Alternately, you could just have do_page_fault() do:
> 
>   while (!down_read_trylock(>mm->mmap_sem))
>   continue;
> 
> However, note that this can suffer from starvation due to a never ending flow
> of mixed write-locks and read-locks on other CPUs. Unlikely, true, but not
> impossible.

How can this help? 

The semaphore is held for reading by the thread that faulted in
futex_wait() -> get_user(), so no writers will be let through. Until the
writer has been let through, the down_read_trylock will never succeed
either. No forward progress can be made even with the above loop.


-Olof
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Linus Torvalds


On Wed, 23 Feb 2005, Joe Korty wrote:
> 
> Perhaps this should be preempt_disable  preempt_enable.

No, the problem with preempt_disable/enable is that they go away if 
preemption is not enabled. So you really do have to do it by hand with the 
"inc_preempt_count".

> Otherwise, a preempt attempt in get_user would not be seen
> until some future preempt_enable was executed.

True. I guess we should have a "preempt_check_resched()" there too. That's 
what "kunmap_atomic()" does too (which is what we rely on in the other 
case we do this..)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Joe Korty
On Tue, Feb 22, 2005 at 01:30:27PM -0800, Linus Torvalds wrote:
> 
> We really have this already, and it's called "current->preempt". It 
> handles any lock at all, and doesn't add yet another special case to all 
> the architectures.
> 
> Just do
> 
>   repeat:
>   down_read(>mm->mmap_sem);
>   get_futex_key(...) etc.
>   queue_me(...) etc.
>   inc_preempt_count();
>   ret = get_user(...);
>   dec_preempt_count();

Perhaps this should be preempt_disable  preempt_enable.

Otherwise, a preempt attempt in get_user would not be seen
until some future preempt_enable was executed.

Regards,
Joe
--
"Money can buy bandwidth, but latency is forever" -- John Mashey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread David Howells
Andrew Morton <[EMAIL PROTECTED]> wrote:

> wrt this down_read/down_write/down_read deadlock: iirc, the reason why
> down_write() takes precedence over down_read() is to avoid the permanent
> writer starvation which would occur if there is heavy down_read() traffic.

down_write() doesn't actually take precedence over down_read() as such. The
rwsems try to implement a strict FIFO queue (the fairness thing).

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread David Howells
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> That is uglee. 

True. You could just wrap it up in inline functions and hide it in a header
file as I suggested in the email I've just sent.

> We really have this already, and it's called "current->preempt". It 
> handles any lock at all, and doesn't add yet another special case to all 
> the architectures.
> 
> Just do
> 
>   repeat:

Alternately, you could just have do_page_fault() do:

while (!down_read_trylock(>mm->mmap_sem))
continue;

However, note that this can suffer from starvation due to a never ending flow
of mixed write-locks and read-locks on other CPUs. Unlikely, true, but not
impossible.


There is yet another way, now that I think about it... Another rwsem could be
added to mm_struct, one that you have to get immediately before mmap_sem and
release immediately after; one that page fault doesn't ever touch... but that
isn't very pleasant either:-/

A yet further way would be to make a second kind of rwsem; one that's
unfair. But if you do that, a few threads that're swapping a lot, combined
with a few tops could starve out another thread that's trying to do an mmap...


Actually, there's an added benefit to your suggested method... It doesn't
involve changing what's in task_struct or mm_struct and doesn't involve
changing the semantics of the mmap_sem lock.


> That's assuming we can't just make rwsem's nest nicely.

rwsems themselves? No, not really. It'd either mean keeing track of which task
holds what sort of lock on every rwsem (which'd be nice for debugging,
granted; but not something you really want in a normal system), or it'd mean
making rwsems unfair - something I think will be a really bad idea.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread David Howells
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> It shouldn't be. If one read writer is active, another should be able to 
> come in, regardless of any pending writer trying to access it. At least 
> that's always been the rule for the rw-spinlocks _exactly_ for this 
> reaseon.

But not with rw-semaphores. I've designed them to be as fair as I can possibly
make them. This means holding reads up if there's a write pending.

> The rwsem code tries to be fairer, maybe it has broken this case in the 
> name of fairness. I personally think fairness is overrated, and would 
> rather have the rwsem implementation let readers come in.

I've seen writer starvation happening due to a continuous flow of overlapping
reads... That's one of the reasons I reimplemented rwsems.

Also, fair rwsems are easier to provide an assembly optimised form for that
doesn't involve the "thundering-herd" approach. Obviously, with the spinlock
approach, you can implement any semantics you desire once you're holding the
spinlock. With the optimised form as implemented, you can't determine how many
active readers there are. This information isn't stored anywhere.

> We have a notion of "atomic get_user()" calls, which is a lot more
> efficient, and is already used for other cases where we have _real_
> deadlocks (inode semaphore for reads into shared memory mappigns).

That's okay, provided the data you're trying to access isn't lurking on a disk
somewhere.

> > Auditing other read-takers of mmap_sem, I found one more exposure that
> > could be solved by just moving code around.
> 
> DavidH - what's the word on nested read-semaphores like this? Are they 
> supposed to work (like nested read-spinlocks), or do we need to do the 
> things Olof does?

Nesting rwsems like this is definitely on the not-recommended list.

For the special case of the mmap semaphore, I'd advocate following something
like Jamie Lokier's solution, and note when a task holds its own mmap_sem
semaphore read-locked such that page-fault can avoid taking it again.

Something like:

struct task_struct {
...
unsigned mmsem_nest;
...
};

static inline void lock_mm_for_read(struct mm_struct *mm) {
if (current->mm == mm &&
current->mmsem_nest++ != 0)
;
else
down_read(>mmap_sem);
}

static inline void unlock_mm_for_read(struct mm_struct *mm) {
if (current->mm == mm &&
--current->mmsem_nest != 0)
;
else
up_read(>mmap_sem);
}

static inline void lock_mm_for_write(struct mm_struct *mm) {
down_write(>mmap_sem);
}

static inline void unlock_mm_for_write(struct mm_struct *mm) {
up_write(>mmap_sem);
}

Though I'd be tempted to say that faulting is the only case in which recursion
is permitted, and change the first two functions to reflect this and add an
extra pair specially for page-fault:

static inline void lock_mm_for_read(struct mm_struct *mm) {
down_read(>mmap_sem);
if (current->mm == mm &&
current->flags |= PF_MMAP_SEM)
}

static inline void unlock_mm_for_read(struct mm_struct *mm) {
if (current->mm == mm &&
current->flags &= ~PF_MMAP_SEM)
up_read(>mmap_sem);
}

static inline void lock_mm_for_fault(struct mm_struct *mm) {
if (!(current->flags & PF_MMAP_SEM))
down_read(>mmap_sem);
}

static inline void unlock_mm_for_fault(struct mm_struct *mm) {
if (!(current->flags & PF_MMAP_SEM))
up_read(>mmap_sem);
}

If current->mm is passed as the argument, the compiler's optimiser should
discard the first comparison in the if-statement.

Then futex.c would hold:

lock_mm_for_read(>mm);
get_futex_key(...) etc.
queue_me(...) etc.
ret = get_user(...);
/* the rest */
unlock_mm_for_read(>mm);

And do_page_fault() would hold:

lock_mm_for_fault(>mm);
...
unlock_mm_for_fault(>mm);

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread David Howells
Linus Torvalds [EMAIL PROTECTED] wrote:

 It shouldn't be. If one read writer is active, another should be able to 
 come in, regardless of any pending writer trying to access it. At least 
 that's always been the rule for the rw-spinlocks _exactly_ for this 
 reaseon.

But not with rw-semaphores. I've designed them to be as fair as I can possibly
make them. This means holding reads up if there's a write pending.

 The rwsem code tries to be fairer, maybe it has broken this case in the 
 name of fairness. I personally think fairness is overrated, and would 
 rather have the rwsem implementation let readers come in.

I've seen writer starvation happening due to a continuous flow of overlapping
reads... That's one of the reasons I reimplemented rwsems.

Also, fair rwsems are easier to provide an assembly optimised form for that
doesn't involve the thundering-herd approach. Obviously, with the spinlock
approach, you can implement any semantics you desire once you're holding the
spinlock. With the optimised form as implemented, you can't determine how many
active readers there are. This information isn't stored anywhere.

 We have a notion of atomic get_user() calls, which is a lot more
 efficient, and is already used for other cases where we have _real_
 deadlocks (inode semaphore for reads into shared memory mappigns).

That's okay, provided the data you're trying to access isn't lurking on a disk
somewhere.

  Auditing other read-takers of mmap_sem, I found one more exposure that
  could be solved by just moving code around.
 
 DavidH - what's the word on nested read-semaphores like this? Are they 
 supposed to work (like nested read-spinlocks), or do we need to do the 
 things Olof does?

Nesting rwsems like this is definitely on the not-recommended list.

For the special case of the mmap semaphore, I'd advocate following something
like Jamie Lokier's solution, and note when a task holds its own mmap_sem
semaphore read-locked such that page-fault can avoid taking it again.

Something like:

struct task_struct {
...
unsigned mmsem_nest;
...
};

static inline void lock_mm_for_read(struct mm_struct *mm) {
if (current-mm == mm 
current-mmsem_nest++ != 0)
;
else
down_read(mm-mmap_sem);
}

static inline void unlock_mm_for_read(struct mm_struct *mm) {
if (current-mm == mm 
--current-mmsem_nest != 0)
;
else
up_read(mm-mmap_sem);
}

static inline void lock_mm_for_write(struct mm_struct *mm) {
down_write(mm-mmap_sem);
}

static inline void unlock_mm_for_write(struct mm_struct *mm) {
up_write(mm-mmap_sem);
}

Though I'd be tempted to say that faulting is the only case in which recursion
is permitted, and change the first two functions to reflect this and add an
extra pair specially for page-fault:

static inline void lock_mm_for_read(struct mm_struct *mm) {
down_read(mm-mmap_sem);
if (current-mm == mm 
current-flags |= PF_MMAP_SEM)
}

static inline void unlock_mm_for_read(struct mm_struct *mm) {
if (current-mm == mm 
current-flags = ~PF_MMAP_SEM)
up_read(mm-mmap_sem);
}

static inline void lock_mm_for_fault(struct mm_struct *mm) {
if (!(current-flags  PF_MMAP_SEM))
down_read(mm-mmap_sem);
}

static inline void unlock_mm_for_fault(struct mm_struct *mm) {
if (!(current-flags  PF_MMAP_SEM))
up_read(mm-mmap_sem);
}

If current-mm is passed as the argument, the compiler's optimiser should
discard the first comparison in the if-statement.

Then futex.c would hold:

lock_mm_for_read(current-mm);
get_futex_key(...) etc.
queue_me(...) etc.
ret = get_user(...);
/* the rest */
unlock_mm_for_read(current-mm);

And do_page_fault() would hold:

lock_mm_for_fault(current-mm);
...
unlock_mm_for_fault(current-mm);

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread David Howells
Linus Torvalds [EMAIL PROTECTED] wrote:

 That is uglee. 

True. You could just wrap it up in inline functions and hide it in a header
file as I suggested in the email I've just sent.

 We really have this already, and it's called current-preempt. It 
 handles any lock at all, and doesn't add yet another special case to all 
 the architectures.
 
 Just do
 
   repeat:

Alternately, you could just have do_page_fault() do:

while (!down_read_trylock(current-mm-mmap_sem))
continue;

However, note that this can suffer from starvation due to a never ending flow
of mixed write-locks and read-locks on other CPUs. Unlikely, true, but not
impossible.


There is yet another way, now that I think about it... Another rwsem could be
added to mm_struct, one that you have to get immediately before mmap_sem and
release immediately after; one that page fault doesn't ever touch... but that
isn't very pleasant either:-/

A yet further way would be to make a second kind of rwsem; one that's
unfair. But if you do that, a few threads that're swapping a lot, combined
with a few tops could starve out another thread that's trying to do an mmap...


Actually, there's an added benefit to your suggested method... It doesn't
involve changing what's in task_struct or mm_struct and doesn't involve
changing the semantics of the mmap_sem lock.


 That's assuming we can't just make rwsem's nest nicely.

rwsems themselves? No, not really. It'd either mean keeing track of which task
holds what sort of lock on every rwsem (which'd be nice for debugging,
granted; but not something you really want in a normal system), or it'd mean
making rwsems unfair - something I think will be a really bad idea.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread David Howells
Andrew Morton [EMAIL PROTECTED] wrote:

 wrt this down_read/down_write/down_read deadlock: iirc, the reason why
 down_write() takes precedence over down_read() is to avoid the permanent
 writer starvation which would occur if there is heavy down_read() traffic.

down_write() doesn't actually take precedence over down_read() as such. The
rwsems try to implement a strict FIFO queue (the fairness thing).

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Joe Korty
On Tue, Feb 22, 2005 at 01:30:27PM -0800, Linus Torvalds wrote:
 
 We really have this already, and it's called current-preempt. It 
 handles any lock at all, and doesn't add yet another special case to all 
 the architectures.
 
 Just do
 
   repeat:
   down_read(current-mm-mmap_sem);
   get_futex_key(...) etc.
   queue_me(...) etc.
   inc_preempt_count();
   ret = get_user(...);
   dec_preempt_count();

Perhaps this should be preempt_disable  preempt_enable.

Otherwise, a preempt attempt in get_user would not be seen
until some future preempt_enable was executed.

Regards,
Joe
--
Money can buy bandwidth, but latency is forever -- John Mashey
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Linus Torvalds


On Wed, 23 Feb 2005, Joe Korty wrote:
 
 Perhaps this should be preempt_disable  preempt_enable.

No, the problem with preempt_disable/enable is that they go away if 
preemption is not enabled. So you really do have to do it by hand with the 
inc_preempt_count.

 Otherwise, a preempt attempt in get_user would not be seen
 until some future preempt_enable was executed.

True. I guess we should have a preempt_check_resched() there too. That's 
what kunmap_atomic() does too (which is what we rely on in the other 
case we do this..)

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Olof Johansson
On Wed, Feb 23, 2005 at 11:39:08AM +, David Howells wrote:
 Alternately, you could just have do_page_fault() do:
 
   while (!down_read_trylock(current-mm-mmap_sem))
   continue;
 
 However, note that this can suffer from starvation due to a never ending flow
 of mixed write-locks and read-locks on other CPUs. Unlikely, true, but not
 impossible.

How can this help? 

The semaphore is held for reading by the thread that faulted in
futex_wait() - get_user(), so no writers will be let through. Until the
writer has been let through, the down_read_trylock will never succeed
either. No forward progress can be made even with the above loop.


-Olof
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Olof Johansson
On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:

  Otherwise, a preempt attempt in get_user would not be seen
  until some future preempt_enable was executed.
 
 True. I guess we should have a preempt_check_resched() there too. That's 
 what kunmap_atomic() does too (which is what we rely on in the other 
 case we do this..)

Ok, this is getting complex enough to warrant get_user_inatomic(),
which means adding it to every arch's uaccess.h.

Below patch does so. Unfortunately I don't have a Viro setup with cross
compilers for nearly every arch, so I can't make sure it doesn't break
anything. But since I pasted the same code everywhere it shouldn't.


-

Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retry.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson [EMAIL PROTECTED]


Index: linux-2.5/kernel/futex.c
===
--- linux-2.5.orig/kernel/futex.c   2005-02-21 16:09:38.0 -0600
+++ linux-2.5/kernel/futex.c2005-02-23 10:55:37.0 -0600
@@ -329,6 +329,7 @@
int ret, drop_count = 0;
unsigned int nqueued;
 
+ retry:
down_read(current-mm-mmap_sem);
 
ret = get_futex_key(uaddr1, key1);
@@ -355,9 +356,17 @@
   before *uaddr1.  */
smp_mb();
 
-   if (get_user(curval, (int __user *)uaddr1) != 0) {
-   ret = -EFAULT;
-   goto out;
+   ret = get_user_inatomic(curval, (int __user *)uaddr1);
+
+   if (unlikely(ret)) {
+   up_read(current-mm-mmap_sem);
+   /* Re-do the access outside the lock */
+   ret = get_user(curval, (int __user *)uaddr1);
+
+   if (!ret)
+   goto retry;
+
+   return ret;
}
if (curval != *valp) {
ret = -EAGAIN;
@@ -480,6 +489,7 @@
int ret, curval;
struct futex_q q;
 
+ retry:
down_read(current-mm-mmap_sem);
 
ret = get_futex_key(uaddr, q.key);
@@ -508,9 +518,21 @@
 * We hold the mmap semaphore, so the mapping cannot have changed
 * since we looked it up in get_futex_key.
 */
-   if (get_user(curval, (int __user *)uaddr) != 0) {
-   ret = -EFAULT;
-   goto out_unqueue;
+
+   ret = get_user_inatomic(curval, (int __user *)uaddr);
+
+   if (unlikely(ret)) {
+   up_read(current-mm-mmap_sem);
+
+   if (!unqueue_me(q)) /* There's a chance we got woken already */
+   return 0;
+
+   /* Re-do the access outside the lock */
+   ret = get_user(curval, (int __user *)uaddr);
+
+   if (!ret)
+   goto retry;
+   return ret;
}
if (curval != val) {
ret = -EWOULDBLOCK;
Index: linux-2.5/mm/mempolicy.c
===
--- linux-2.5.orig/mm/mempolicy.c   2005-02-04 00:27:40.0 -0600
+++ linux-2.5/mm/mempolicy.c2005-02-23 10:16:46.0 -0600
@@ -524,9 +524,13 @@
} else
pval = pol-policy;
 
-   err = -EFAULT;
+   if (vma) {
+   up_read(current-mm-mmap_sem);
+   vma = NULL;
+   }
+
if (policy  put_user(pval, policy))
-   goto out;
+   return -EFAULT;
 
err = 0;
if (nmask) {
Index: linux-2.5/include/asm-ppc64/uaccess.h
===
--- linux-2.5.orig/include/asm-ppc64/uaccess.h  2005-02-04 00:27:13.0 
-0600
+++ linux-2.5/include/asm-ppc64/uaccess.h   2005-02-23 11:01:44.0 
-0600
@@ -217,6 +217,17 @@
: =r(err), =r(x)\
: b(addr), i(errret), 0(err))
 
+
+#define get_user_inatomic(x,ptr)   \
+({ \
+   int __ret;  \
+   inc_preempt_count();\
+   __ret = __copy_from_user_inatomic((x),(ptr),sizeof(*(ptr)));   \
+   dec_preempt_count();\
+   preempt_check_resched();\
+   __ret;  \
+})
+
 /* more 

Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Arjan van de Ven
On Wed, 2005-02-23 at 11:10 -0600, Olof Johansson wrote:
 On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:
 
   Otherwise, a preempt attempt in get_user would not be seen
   until some future preempt_enable was executed.
  
  True. I guess we should have a preempt_check_resched() there too. That's 
  what kunmap_atomic() does too (which is what we rely on in the other 
  case we do this..)
 
 Ok, this is getting complex enough to warrant get_user_inatomic(),
 which means adding it to every arch's uaccess.h.
 
 Below patch does so. Unfortunately I don't have a Viro setup with cross
 compilers for nearly every arch, so I can't make sure it doesn't break
 anything. But since I pasted the same code everywhere it shouldn't.

I hate to do this to you, but get_user is a bit horrible in that it is
an untyped interface. Fixing it is hard (ugh) but when adding new
variants should/could we consider to please make it a typed (eg inline
and not a define) interface please ?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Jamie Lokier
Olof Johansson wrote:
 On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:
 
   Otherwise, a preempt attempt in get_user would not be seen
   until some future preempt_enable was executed.
  
  True. I guess we should have a preempt_check_resched() there too. That's 
  what kunmap_atomic() does too (which is what we rely on in the other 
  case we do this..)
 
 Ok, this is getting complex enough to warrant get_user_inatomic(),
 which means adding it to every arch's uaccess.h.
 
 Below patch does so. Unfortunately I don't have a Viro setup with cross
 compilers for nearly every arch, so I can't make sure it doesn't break
 anything. But since I pasted the same code everywhere it shouldn't.

My turn to say uglee.

Firstly, get_user_inatomic is the wrong name.

inatomic in __copy_from_user_inatomic means it's called inside a
non-premptable region (in atomic...).

Your macro get_user_inatomic is _not_ called inside a
non-preemptable region, so it shouldn't be called inatomic.

(A better name is get_user_no_paging).

Secondly, does this _one_ use (it's not likely to be used elsewhere)
justify copying  pasting the same code into every asm-*/uaccess,
especially when the code is not in any way arch-specific?

I suggest putting it into futex.c, and make it an inline function
which takes u32 __user *.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Linus Torvalds


On Wed, 23 Feb 2005, Jamie Lokier wrote:
 
 I suggest putting it into futex.c, and make it an inline function
 which takes u32 __user *.

Agreed, except we've traditionally just made it int __user *.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Olof Johansson
On Wed, Feb 23, 2005 at 06:22:04PM +, Jamie Lokier wrote:
 Olof Johansson wrote:
  On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:
  
Otherwise, a preempt attempt in get_user would not be seen
until some future preempt_enable was executed.
   
   True. I guess we should have a preempt_check_resched() there too. 
   That's 
   what kunmap_atomic() does too (which is what we rely on in the other 
   case we do this..)
  
  Ok, this is getting complex enough to warrant get_user_inatomic(),
  which means adding it to every arch's uaccess.h.
  
  Below patch does so. Unfortunately I don't have a Viro setup with cross
  compilers for nearly every arch, so I can't make sure it doesn't break
  anything. But since I pasted the same code everywhere it shouldn't.
 
 My turn to say uglee.

Yeah, I wasn't entirely happy about it, but it seems that suggestions
are coming in on how to do it better. :-)

 Firstly, get_user_inatomic is the wrong name.
 
 inatomic in __copy_from_user_inatomic means it's called inside a
 non-premptable region (in atomic...).
 
 Your macro get_user_inatomic is _not_ called inside a
 non-preemptable region, so it shouldn't be called inatomic.
 
 (A better name is get_user_no_paging).

 Secondly, does this _one_ use (it's not likely to be used elsewhere)
 justify copying  pasting the same code into every asm-*/uaccess,
 especially when the code is not in any way arch-specific?

Arjan suggested creating a linux/uaccess.h that includes asm/uaccess.h,
and start moving the users over since that's where the trend is moving
anyway (avoiding including asm/* from common code). futex.c would be
the first user, and could be followed by more later as a janitorial
patch. That'd mean only one addition of the common function instead of
having to add it to every arch.

 I suggest putting it into futex.c, and make it an inline function
 which takes u32 __user *.

Sure, for now that's good enough. Above janitorial work could be done
later, if more users get introduced.


-Olof
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread David Howells
Olof Johansson [EMAIL PROTECTED] wrote:

  Alternately, you could just have do_page_fault() do:
  
  while (!down_read_trylock(current-mm-mmap_sem))
  continue;
  
  However, note that this can suffer from starvation due to a never ending
  flow of mixed write-locks and read-locks on other CPUs. Unlikely, true,
  but not impossible.
 
 How can this help? 
 
 The semaphore is held for reading by the thread that faulted in
 futex_wait() - get_user(), so no writers will be let through. Until the
 writer has been let through, the down_read_trylock will never succeed
 either. No forward progress can be made even with the above loop.

You're right. The writers would have to spin instead.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Jamie Lokier
Linus Torvalds wrote:
  I suggest putting it into futex.c, and make it an inline function
  which takes u32 __user *.
 
 Agreed, except we've traditionally just made it int __user *.

The type signatures in futex.c are a bit mixed up - most places say
int __user * but sys_futex() says u32 __user *.  get_futex_key
uses sizeof(u32) to check the address.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Olof Johansson
On Wed, Feb 23, 2005 at 06:49:46PM +, Jamie Lokier wrote:
 Linus Torvalds wrote:
   I suggest putting it into futex.c, and make it an inline function
   which takes u32 __user *.
  
  Agreed, except we've traditionally just made it int __user *.
 
 The type signatures in futex.c are a bit mixed up - most places say
 int __user * but sys_futex() says u32 __user *.  get_futex_key
 uses sizeof(u32) to check the address.

How's this? I went with get_val_no_fault(), since it isn't really a
get_user.*() any more (ptr being passed in), and no_paging is a little
misleading (not all faults are due to paging).


-

Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retry.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson [EMAIL PROTECTED]


Index: linux-2.5/kernel/futex.c
===
--- linux-2.5.orig/kernel/futex.c   2005-02-21 16:09:38.0 -0600
+++ linux-2.5/kernel/futex.c2005-02-23 13:10:16.0 -0600
@@ -258,6 +258,18 @@
}
 }
 
+static inline int get_val_no_fault(int *dest, int __user *from)
+{
+   int ret;
+
+   inc_preempt_count();
+   ret = __copy_from_user_inatomic(dest, from, sizeof(int));
+   dec_preempt_count();
+   preempt_check_resched();
+
+   return ret;
+}
+
 /*
  * The hash bucket lock must be held when this is called.
  * Afterwards, the futex_q must not be accessed.
@@ -329,6 +341,7 @@
int ret, drop_count = 0;
unsigned int nqueued;
 
+ retry:
down_read(current-mm-mmap_sem);
 
ret = get_futex_key(uaddr1, key1);
@@ -355,9 +368,20 @@
   before *uaddr1.  */
smp_mb();
 
-   if (get_user(curval, (int __user *)uaddr1) != 0) {
-   ret = -EFAULT;
-   goto out;
+   ret = get_val_no_fault(curval, (int __user *)uaddr1);
+
+   if (unlikely(ret)) {
+   /* If we would have faulted, release mmap_sem, fault
+* it in and start all over again.
+*/
+   up_read(current-mm-mmap_sem);
+
+   ret = get_user(curval, (int __user *)uaddr1);
+
+   if (!ret)
+   goto retry;
+
+   return ret;
}
if (curval != *valp) {
ret = -EAGAIN;
@@ -480,6 +504,7 @@
int ret, curval;
struct futex_q q;
 
+ retry:
down_read(current-mm-mmap_sem);
 
ret = get_futex_key(uaddr, q.key);
@@ -508,9 +533,23 @@
 * We hold the mmap semaphore, so the mapping cannot have changed
 * since we looked it up in get_futex_key.
 */
-   if (get_user(curval, (int __user *)uaddr) != 0) {
-   ret = -EFAULT;
-   goto out_unqueue;
+
+   ret = get_val_no_fault(curval, (int __user *)uaddr);
+
+   if (unlikely(ret)) {
+   /* If we would have faulted, release mmap_sem, fault it in and
+* start all over again.
+*/
+   up_read(current-mm-mmap_sem);
+
+   if (!unqueue_me(q)) /* There's a chance we got woken already */
+   return 0;
+
+   ret = get_user(curval, (int __user *)uaddr);
+
+   if (!ret)
+   goto retry;
+   return ret;
}
if (curval != val) {
ret = -EWOULDBLOCK;
Index: linux-2.5/mm/mempolicy.c
===
--- linux-2.5.orig/mm/mempolicy.c   2005-02-04 00:27:40.0 -0600
+++ linux-2.5/mm/mempolicy.c2005-02-23 12:53:22.0 -0600
@@ -524,9 +524,13 @@
} else
pval = pol-policy;
 
-   err = -EFAULT;
+   if (vma) {
+   up_read(current-mm-mmap_sem);
+   vma = NULL;
+   }
+
if (policy  put_user(pval, policy))
-   goto out;
+   return -EFAULT;
 
err = 0;
if (nmask) {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Linus Torvalds


On Wed, 23 Feb 2005, Olof Johansson wrote:
 
 How's this? I went with get_val_no_fault(), since it isn't really a
 get_user.*() any more (ptr being passed in), and no_paging is a little
 misleading (not all faults are due to paging).

Applied with minor cosmetic changes. I'm like a dog who likes to pee on 
things to show his territory, so I changed get_val_no_fault to 
get_futex_value_locked, and I made sure that the return value is 
sensible (return 0 or -EFAULT rather than the __memcpy_from_user() 
return value which is how many bytes we couldn't copy).

Not that we care (we just check the return value against zero anyway,
which is success in both cases), but the compiler should be able to
optimize it away, and it might avoid some confusion down the line..

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Jamie Lokier
Olof Johansson wrote:
 How's this? I went with get_val_no_fault(), since it isn't really a
 get_user.*() any more (ptr being passed in), and no_paging is a little
 misleading (not all faults are due to paging).

How ironic: I deliberately didn't choose no_fault because that
function *does* take page faults.  It only disables paging operations! :)

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Olof Johansson
On Tue, Feb 22, 2005 at 03:20:27PM -0800, Andrew Morton wrote:
> [EMAIL PROTECTED] (Olof Johansson) wrote:
> >
> > +   inc_preempt_count();
> > +   ret = get_user(curval, (int __user *)uaddr1);
> > +   dec_preempt_count();
> 
> That _should_ generate a might_sleep() warning, except it looks like we
> forgot to add a check to get_user().
> 
> It would be better to use __copy_from_user_inatomic() here, I think.

Thanks for catching that. New rev below.

-

Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retry.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson <[EMAIL PROTECTED]>


Index: linux-2.5/kernel/futex.c
===
--- linux-2.5.orig/kernel/futex.c   2005-02-21 16:09:38.0 -0600
+++ linux-2.5/kernel/futex.c2005-02-22 17:20:22.0 -0600
@@ -329,6 +329,7 @@
int ret, drop_count = 0;
unsigned int nqueued;
 
+ retry:
down_read(>mm->mmap_sem);
 
ret = get_futex_key(uaddr1, );
@@ -355,9 +356,19 @@
   before *uaddr1.  */
smp_mb();
 
-   if (get_user(curval, (int __user *)uaddr1) != 0) {
-   ret = -EFAULT;
-   goto out;
+   inc_preempt_count();
+   ret = __copy_from_user_inatomic(, (int __user *)uaddr1, 
sizeof(int));
+   dec_preempt_count();
+
+   if (unlikely(ret)) {
+   up_read(>mm->mmap_sem);
+   /* Re-do the access outside the lock */
+   ret = get_user(curval, (int __user *)uaddr1);
+
+   if (!ret)
+   goto retry;
+
+   return ret;
}
if (curval != *valp) {
ret = -EAGAIN;
@@ -480,6 +491,7 @@
int ret, curval;
struct futex_q q;
 
+ retry:
down_read(>mm->mmap_sem);
 
ret = get_futex_key(uaddr, );
@@ -508,9 +520,21 @@
 * We hold the mmap semaphore, so the mapping cannot have changed
 * since we looked it up in get_futex_key.
 */
-   if (get_user(curval, (int __user *)uaddr) != 0) {
-   ret = -EFAULT;
-   goto out_unqueue;
+   inc_preempt_count();
+   ret = __copy_from_user_inatomic(, (int __user *)uaddr, 
sizeof(int));
+   dec_preempt_count();
+   if (unlikely(ret)) {
+   up_read(>mm->mmap_sem);
+
+   if (!unqueue_me()) /* There's a chance we got woken already */
+   return 0;
+
+   /* Re-do the access outside the lock */
+   ret = get_user(curval, (int __user *)uaddr);
+
+   if (!ret)
+   goto retry;
+   return ret;
}
if (curval != val) {
ret = -EWOULDBLOCK;
Index: linux-2.5/mm/mempolicy.c
===
--- linux-2.5.orig/mm/mempolicy.c   2005-02-04 00:27:40.0 -0600
+++ linux-2.5/mm/mempolicy.c2005-02-22 14:34:19.0 -0600
@@ -524,9 +524,13 @@
} else
pval = pol->policy;
 
-   err = -EFAULT;
+   if (vma) {
+   up_read(>mm->mmap_sem);
+   vma = NULL;
+   }
+
if (policy && put_user(pval, policy))
-   goto out;
+   return -EFAULT;
 
err = 0;
if (nmask) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Andrew Morton
[EMAIL PROTECTED] (Olof Johansson) wrote:
>
> + inc_preempt_count();
> + ret = get_user(curval, (int __user *)uaddr1);
> + dec_preempt_count();

That _should_ generate a might_sleep() warning, except it looks like we
forgot to add a check to get_user().

It would be better to use __copy_from_user_inatomic() here, I think.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Greg KH
On Tue, Feb 22, 2005 at 02:10:58PM -0800, Linus Torvalds wrote:
> 
> Oh, well. The reason I hate the rwsem behaviour is exactly because it
> results in this very subtle class of deadlocks. This one case is certainly
> solvable several ways, but do we have other issues somewhere else? Things
> like kobject might be ripe with things like this. The mm semaphore tends
> to be pretty well-behaved - and I'm not sure the same is true of the
> kobject one.

I'm trying to get rid of the kobject (actually the subsystem) rwsem
right now, so it should be gone completly within a few kernel versions.

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Olof Johansson
On Tue, Feb 22, 2005 at 10:34:57PM +, Jamie Lokier wrote:
> There is one small but important error: the "return ret" mustn't just
> return.  It must call unqueue_me() just like the code at out_unqueue,
> _including_ the conditional "ret = 0", but _excluding_ the up_read().

Not only that, but someone might already have dequeued us, right? It's
probably a pathological case though, i.e. someone did a wake on the same
(bad) address.

How's this patch? It's closer to Linus' pseudo-code than Andrew's, to
avoid the extra get_user() at function entry and keep the common case
path short.

It also includes the feedback from Andrew on the sys_get_mempolicy(),
making the patch even simpler there.




Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retry.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson <[EMAIL PROTECTED]>


Index: linux-2.5/kernel/futex.c
===
--- linux-2.5.orig/kernel/futex.c   2005-02-21 16:09:38.0 -0600
+++ linux-2.5/kernel/futex.c2005-02-22 16:38:24.0 -0600
@@ -329,6 +329,7 @@
int ret, drop_count = 0;
unsigned int nqueued;
 
+ retry:
down_read(>mm->mmap_sem);
 
ret = get_futex_key(uaddr1, );
@@ -355,9 +356,19 @@
   before *uaddr1.  */
smp_mb();
 
-   if (get_user(curval, (int __user *)uaddr1) != 0) {
-   ret = -EFAULT;
-   goto out;
+   inc_preempt_count();
+   ret = get_user(curval, (int __user *)uaddr1);
+   dec_preempt_count();
+
+   if (unlikely(ret)) {
+   up_read(>mm->mmap_sem);
+   /* Re-do the access outside the lock */
+   ret = get_user(curval, (int __user *)uaddr1);
+
+   if (!ret)
+   goto retry;
+
+   return ret;
}
if (curval != *valp) {
ret = -EAGAIN;
@@ -480,6 +491,7 @@
int ret, curval;
struct futex_q q;
 
+ retry:
down_read(>mm->mmap_sem);
 
ret = get_futex_key(uaddr, );
@@ -508,9 +520,21 @@
 * We hold the mmap semaphore, so the mapping cannot have changed
 * since we looked it up in get_futex_key.
 */
-   if (get_user(curval, (int __user *)uaddr) != 0) {
-   ret = -EFAULT;
-   goto out_unqueue;
+   inc_preempt_count();
+   ret = get_user(curval, (int __user *)uaddr);
+   dec_preempt_count();
+   if (unlikely(ret)) {
+   up_read(>mm->mmap_sem);
+
+   if (!unqueue_me()) /* There's a chance we got woken already */
+   return 0;
+
+   /* Re-do the access outside the lock */
+   ret = get_user(curval, (int __user *)uaddr);
+
+   if (!ret)
+   goto retry;
+   return ret;
}
if (curval != val) {
ret = -EWOULDBLOCK;
Index: linux-2.5/mm/mempolicy.c
===
--- linux-2.5.orig/mm/mempolicy.c   2005-02-04 00:27:40.0 -0600
+++ linux-2.5/mm/mempolicy.c2005-02-22 14:34:19.0 -0600
@@ -524,9 +524,13 @@
} else
pval = pol->policy;
 
-   err = -EFAULT;
+   if (vma) {
+   up_read(>mm->mmap_sem);
+   vma = NULL;
+   }
+
if (policy && put_user(pval, policy))
-   goto out;
+   return -EFAULT;
 
err = 0;
if (nmask) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Jamie Lokier
Linus Torvalds wrote:
> > queue_me(...) etc.
> > current->flags |= PF_MMAP_SEM; <- new
> > ret = get_user(...);
> > current->flags &= PF_MMAP_SEM; <- new
> > /* the rest */
> 
> That is uglee. 
> 
> We really have this already, and it's called "current->preempt". It 
> handles any lock at all, and doesn't add yet another special case to all 
> the architectures.

Ooh, I didn't know current->preempt did that (been away).

>   repeat:
>   down_read(>mm->mmap_sem);
>   get_futex_key(...) etc.
>   queue_me(...) etc.
>   inc_preempt_count();
>   ret = get_user(...);
>   dec_preempt_count();
>   if (unlikely(ret)) {
>   up_read(>mm->mmap_sem);
>   /* Re-do the access outside the lock */
>   ret = get_user(...);
>   if (!ret)
>   goto repeat;
>   return ret;
>   }

That would work.  I like it. :)

Page faults will enter the fault handler twice (i.e. slower), but
that's not really a disadvantage, because a program always references
the memory just before calling futex_wait anyway.  A fault is rare.

There is one small but important error: the "return ret" mustn't just
return.  It must call unqueue_me() just like the code at out_unqueue,
_including_ the conditional "ret = 0", but _excluding_ the up_read().

Alternatively, since it's a rare case, just shuffle the loop around:

down_read(>mm->mmap_sem);
repeat:
get_futex_key(...) etc.
queue_me(...) etc.
inc_preempt_count();
ret = get_user(...);
dec_preempt_count();
if (unlikely(ret)) {
up_read(>mm->mmap_sem);
/* Re-do the access outside the lock */
ret = get_user(...);
down_read(>mm->mmap_sem);
if (!ret)
goto repeat;
goto out_unqueue;
}

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Benjamin Herrenschmidt
On Tue, 2005-02-22 at 14:10 -0800, Linus Torvalds wrote:

> Oh, well. The reason I hate the rwsem behaviour is exactly because it
> results in this very subtle class of deadlocks. This one case is certainly
> solvable several ways, but do we have other issues somewhere else? Things
> like kobject might be ripe with things like this. The mm semaphore tends
> to be pretty well-behaved - and I'm not sure the same is true of the
> kobject one.

We could detect those tho. When the appropriate DEBUG option is set, by
storing a cpumask in the semaphore we could detect if it's already taken
on this cpu...

> Normal recursive deadlocks are wonderful - most of them show up
> immediately, so assuming you just have enough coverage, you're fine. This
> fairness-related deadlock requires a race to happen.

Unless you consider that taking the read semaphore twice on the same CPU
is always a bug, thus the above stuff would work for catching them at
least more often...
 
> Maybe it would be sufficient to have a debugging version of rwsems that
> just notice recursion?
> 
>   Linus
-- 
Benjamin Herrenschmidt <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Jamie Lokier
Olof Johansson wrote:
> > That won't work because the vma lock must be help between key
> > calculation and get_user() - otherwise futex is not reliable.  It
> > would work if the futex key calculation was inside the loop.
> 
> Sure, but that's still true: It's just that the get_user() is done twice
> instead. The semaphore is never released between the key calculation and
> the "real" get_user().

Ah, I didn't look at where the loop is used and didn't think there'd
be _two_ get_user() calls in the fast case.  Not my instinct.

> > A much simpler solution (and sorry for not offering it earlier,
> > because Andrew Morton pointed out this bug long ago, but I was busy), is:
> 
> Either way works for me. Andrew/Linus, got a preference? I'll either
> post my refresh based on Andrews comments, or code up Jamie's
> suggestion.

Yours has a couple of problems.

   1. It'll make futex waits somewhat slower.  One of the nicer features
  of 2.6 futexes is that we got rid of the explicit page table lookup.

   2. It's broken because a page can be paged out by another thread
  after you've forced it in and before the get_user().  We only
  take mmap_sem, not the page table lock.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Linus Torvalds


On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote:
> 
> Yours is probably the most efficient too. Note sure what is best for
> rwsems tho, there seem to be some interest preventing readers from
> starving writers for ever, this has been debated endlessly iirc,
> though I have no personal opinion there.

Yes, the starvation issue is potentially real. And thinking about it,
we've even had that in real life, with /proc and lots of page faults. So I
guess that's a strong argument for the fairness thing.

Oh, well. The reason I hate the rwsem behaviour is exactly because it
results in this very subtle class of deadlocks. This one case is certainly
solvable several ways, but do we have other issues somewhere else? Things
like kobject might be ripe with things like this. The mm semaphore tends
to be pretty well-behaved - and I'm not sure the same is true of the
kobject one.

Normal recursive deadlocks are wonderful - most of them show up
immediately, so assuming you just have enough coverage, you're fine. This
fairness-related deadlock requires a race to happen.

Maybe it would be sufficient to have a debugging version of rwsems that
just notice recursion?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Linus Torvalds


On Tue, 22 Feb 2005, Andrew Morton wrote:
> 
> However the pte can get unmapped by memory reclaim so we could still take a
> minor fault, and hit the same deadlock, yes?

You _could_ fix that by getting the pagetable spinlock, I guess. Which
check_user_page_readable() assumes you'd be holding anyway (not holding it
would appear to be a bug).

At which point you might as well just walk the tables by hand and just do 
the read that way. Of course, then you have virtual aliasing issues etc.

Insane, but possible.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Benjamin Herrenschmidt
On Tue, 2005-02-22 at 13:31 -0800, Linus Torvalds wrote:
> 
> On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote:
> > 
> > Isn't Olof scheme racy ? Can't the stuff get swapped out between the
> > first get_user() and the "real" one ?
> 
> Yes. But see my suggested modification (which I still think is "the thing 
> that Olof does", except it's more efficient and avoids the race).
> 
> If rwsems acted like rwlocks, we wouldn't have this issue at all.

Yours is probably the most efficient too. Note sure what is best for
rwsems tho, there seem to be some interest preventing readers from
starving writers for ever, this has been debated endlessly iirc,
though I have no personal opinion there.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Andrew Morton
Jamie Lokier <[EMAIL PROTECTED]> wrote:
>
> ...
> 
> > > One attempt to fix this is included below. It works, but I'm not entirely
> > > happy with the fact that it's a bit messy solution. If anyone has a
> > > better idea for how to solve it I'd be all ears.
> > 
> > It's fairly sane.  Style-wise I'd be inclined to turn this:
> > 
> > down_read(>mm->mmap_sem);
> > while (!check_user_page_readable(current->mm, uaddr1)) {
> > up_read(>mm->mmap_sem);
> > /* Fault in the page through get_user() but discard result */
> > if (get_user(curval, (int __user *)uaddr1) != 0)
> > return -EFAULT;
> > down_read(>mm->mmap_sem);
> > }
> 
> That won't work because the vma lock must be help between key
> calculation and get_user() - otherwise futex is not reliable.  It
> would work if the futex key calculation was inside the loop.

All the above is trying to do is to convert the initial down_read(mmap_sem)
into a function which, on exit, guarantees that

a) down_read(mmap_sem) is held and

b) the subsequent get_user() of that address will not generate a pagefault.

So it shouldn't affect the futex code's atomicity at all.

However the pte can get unmapped by memory reclaim so we could still take a
minor fault, and hit the same deadlock, yes?

> A much simpler solution (and sorry for not offering it earlier,
> because Andrew Morton pointed out this bug long ago, but I was busy), is:
> 
> In futex.c:
> 
>   down_read(>mm->mmap_sem);
>   get_futex_key(...) etc.
>   queue_me(...) etc.
>   current->flags |= PF_MMAP_SEM; <- new
>   ret = get_user(...);
>   current->flags &= PF_MMAP_SEM; <- new
>   /* the rest */
> 
> And in arch/*/mm/fault.c, replace every one of these:
> 
>   down_read(>mmap_sem);
> 
>   up_read(>mmap_sem);
> 
> with these:
> 
>   if (!(current & PF_MMAP_SEM))
>   down_read(>mmap_sem);
> 
>   if (!(current & PF_MMAP_SEM))
>   up_read(>mmap_sem);
> 

Yes, that will work.  However I do feel that it's cleaner to localise this
nastiness into a single function which the futex code calls, rather than
spreading it all around and adding overhead to every pagefault.  If we can
work out how.

wrt this down_read/down_write/down_read deadlock: iirc, the reason why
down_write() takes precedence over down_read() is to avoid the permanent
writer starvation which would occur if there is heavy down_read() traffic.

As Linus points out, an alternative would be to do an inc_preempt_count()
around the offending get_user(), then use __copy_from_user_inatomic(), then
take some sort of remedial action if __copy_from_user_inatomic() returns a
fault.  Something like:

retry:
if (get_user(uaddr) == -EFAULT)
return -EFAULT;
down_read(mmap_sem);
inc_preempt_count();
if (__copy_from_user_inatomic(..., uaddr)) {
up_read(mmap_sem);
dec_preempt_count();
goto retry;
}

dec_preempt_count();
up_read(mmap_sem);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Linus Torvalds


On Tue, 22 Feb 2005, Jamie Lokier wrote:
> 
> A much simpler solution (and sorry for not offering it earlier,
> because Andrew Morton pointed out this bug long ago, but I was busy), is:
> 
> In futex.c:
> 
>   down_read(>mm->mmap_sem);
>   get_futex_key(...) etc.
>   queue_me(...) etc.
>   current->flags |= PF_MMAP_SEM; <- new
>   ret = get_user(...);
>   current->flags &= PF_MMAP_SEM; <- new
>   /* the rest */

That is uglee. 

We really have this already, and it's called "current->preempt". It 
handles any lock at all, and doesn't add yet another special case to all 
the architectures.

Just do

repeat:
down_read(>mm->mmap_sem);
get_futex_key(...) etc.
queue_me(...) etc.
inc_preempt_count();
ret = get_user(...);
dec_preempt_count();
if (unlikely(ret)) {
up_read(>mm->mmap_sem);
/* Re-do the access outside the lock */
ret = get_user(...);
if (!ret)
goto repeat;
return ret;
}
...

and you should be ok.

No new special cases, no new abstractions. At most, we should probably 
create a "get_user_inatomic()", to 

 - make it damn obvious what we're doing, and match the explicit
   "inatomic" in the other place where we depend on this (fs/filemap.c)

 - allow the regular "get_user()" to continue to do the normal
   "might_sleep()" checks.

That's assuming we can't just make rwsem's nest nicely.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Linus Torvalds


On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote:
> 
> Isn't Olof scheme racy ? Can't the stuff get swapped out between the
> first get_user() and the "real" one ?

Yes. But see my suggested modification (which I still think is "the thing 
that Olof does", except it's more efficient and avoids the race).

If rwsems acted like rwlocks, we wouldn't have this issue at all.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Jamie Lokier
Chris Friesen wrote:
> > down_read(>mm->mmap_sem);
> > get_futex_key(...) etc.
> > queue_me(...) etc.
> > current->flags |= PF_MMAP_SEM; <- new
> > ret = get_user(...);
> > current->flags &= PF_MMAP_SEM; <- new
> > /* the rest */
> 
> Should the second new line be this (with the inverse)?
> 
>   current->flags &= ~PF_MMAP_SEM;

Quiet!  I was trying to sneak in a security hole! :)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Olof Johansson
On Tue, Feb 22, 2005 at 09:07:52PM +, Jamie Lokier wrote:
> 
> That won't work because the vma lock must be help between key
> calculation and get_user() - otherwise futex is not reliable.  It
> would work if the futex key calculation was inside the loop.

Sure, but that's still true: It's just that the get_user() is done twice
instead. The semaphore is never released between the key calculation and
the "real" get_user().

> A much simpler solution (and sorry for not offering it earlier,
> because Andrew Morton pointed out this bug long ago, but I was busy), is:

Either way works for me. Andrew/Linus, got a preference? I'll either
post my refresh based on Andrews comments, or code up Jamie's
suggestion.


-Olof
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Benjamin Herrenschmidt
On Wed, 2005-02-23 at 08:16 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2005-02-22 at 11:36 -0800, Linus Torvalds wrote:
> 
> > DavidH - what's the word on nested read-semaphores like this? Are they 
> > supposed to work (like nested read-spinlocks), or do we need to do the 
> > things Olof does?
> 
> Isn't Olof scheme racy ? Can't the stuff get swapped out between the
> first get_user() and the "real" one ?

Forget it, I missed the check_user_page_readable() guy within the
semaphore protection. I didn't know that function ;)

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Chris Friesen
Jamie Lokier wrote:
In futex.c:
down_read(>mm->mmap_sem);
get_futex_key(...) etc.
queue_me(...) etc.
current->flags |= PF_MMAP_SEM; <- new
ret = get_user(...);
current->flags &= PF_MMAP_SEM; <- new
/* the rest */
Should the second new line be this (with the inverse)?
current->flags &= ~PF_MMAP_SEM;
Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Benjamin Herrenschmidt
On Tue, 2005-02-22 at 11:36 -0800, Linus Torvalds wrote:

> DavidH - what's the word on nested read-semaphores like this? Are they 
> supposed to work (like nested read-spinlocks), or do we need to do the 
> things Olof does?

Isn't Olof scheme racy ? Can't the stuff get swapped out between the
first get_user() and the "real" one ?

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Jamie Lokier
Andrew Morton wrote:
> > This will quickly lock up, since the futex_wait code dows a
> > down_read(mmap_sem), then a get_user().
> > 
> > The do_page_fault code on ppc64 (as well as other architectures) needs
> > to take the same semaphore for reading. This is all good until the
> > second thread comes into play: Its mmap call tries to take the same
> > semaphore for writing which causes in the do_page_fault down_read()
> > to get stuck. Classic deadlock.
> 
> Yup.  Jamie says that the futex code _has_ to hold mmap_sem across the
> get_user().  I forget (but could probably locate) the details.

It does - the "key" which identifies a futex depends on a vma
calculation, and the vma must not change between the calculation and
the get_user().

> > One attempt to fix this is included below. It works, but I'm not entirely
> > happy with the fact that it's a bit messy solution. If anyone has a
> > better idea for how to solve it I'd be all ears.
> 
> It's fairly sane.  Style-wise I'd be inclined to turn this:
> 
>   down_read(>mm->mmap_sem);
>   while (!check_user_page_readable(current->mm, uaddr1)) {
>   up_read(>mm->mmap_sem);
>   /* Fault in the page through get_user() but discard result */
>   if (get_user(curval, (int __user *)uaddr1) != 0)
>   return -EFAULT;
>   down_read(>mm->mmap_sem);
>   }

That won't work because the vma lock must be help between key
calculation and get_user() - otherwise futex is not reliable.  It
would work if the futex key calculation was inside the loop.

A much simpler solution (and sorry for not offering it earlier,
because Andrew Morton pointed out this bug long ago, but I was busy), is:

In futex.c:

down_read(>mm->mmap_sem);
get_futex_key(...) etc.
queue_me(...) etc.
current->flags |= PF_MMAP_SEM; <- new
ret = get_user(...);
current->flags &= PF_MMAP_SEM; <- new
/* the rest */

And in arch/*/mm/fault.c, replace every one of these:

down_read(>mmap_sem);

up_read(>mmap_sem);

with these:

if (!(current & PF_MMAP_SEM))
down_read(>mmap_sem);

if (!(current & PF_MMAP_SEM))
up_read(>mmap_sem);

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Andrew Morton
[EMAIL PROTECTED] (Olof Johansson) wrote:
>
> Hi,
> 
> Consider a small testcase that spawns off two threads, either thread
> doing a loop of:
> 
>   buf = mmap /dev/zero MAP_SHARED for 0x10 bytes
>   call sys_futex (buf+page, FUTEX_WAIT, 1, NULL, NULL) for each page in 
> said mmap
>   munmap(buf)
>   repeat
> 
> This will quickly lock up, since the futex_wait code dows a
> down_read(mmap_sem), then a get_user().
> 
> The do_page_fault code on ppc64 (as well as other architectures) needs
> to take the same semaphore for reading. This is all good until the
> second thread comes into play: Its mmap call tries to take the same
> semaphore for writing which causes in the do_page_fault down_read()
> to get stuck. Classic deadlock.

Yup.  Jamie says that the futex code _has_ to hold mmap_sem across the
get_user().  I forget (but could probably locate) the details.

>
> One attempt to fix this is included below. It works, but I'm not entirely
> happy with the fact that it's a bit messy solution. If anyone has a
> better idea for how to solve it I'd be all ears.

It's fairly sane.  Style-wise I'd be inclined to turn this:

down_read(>mm->mmap_sem);
while (!check_user_page_readable(current->mm, uaddr1)) {
up_read(>mm->mmap_sem);
/* Fault in the page through get_user() but discard result */
if (get_user(curval, (int __user *)uaddr1) != 0)
return -EFAULT;
down_read(>mm->mmap_sem);
}

into a standalone helper function.

> --- linux-2.5.orig/mm/mempolicy.c 2005-02-04 00:27:40.0 -0600
> +++ linux-2.5/mm/mempolicy.c  2005-02-21 16:43:08.0 -0600
> @@ -486,6 +486,7 @@
>   struct mm_struct *mm = current->mm;
>   struct vm_area_struct *vma = NULL;
>   struct mempolicy *pol = current->mempolicy;
> + DECLARE_BITMAP(nodes, MAX_NUMNODES);
>  
>   if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
>   return -EINVAL;
> @@ -524,16 +525,21 @@
>   } else
>   pval = pol->policy;
>  
> + if (nmask)
> + get_zonemask(pol, nodes);
> +
> + if (vma) {
> + up_read(>mm->mmap_sem);
> + vma = NULL;
> + }
> +

OK.

>   err = -EFAULT;
>   if (policy && put_user(pval, policy))
>   goto out;
>  
>   err = 0;
> - if (nmask) {
> - DECLARE_BITMAP(nodes, MAX_NUMNODES);
> - get_zonemask(pol, nodes);
> + if (nmask)
>   err = copy_nodes_to_user(nmask, maxnode, nodes, sizeof(nodes));
> - }
>  
>   out:
>   if (vma)

I don't think we need to hold mmap_sem while running get_zonemask().  `pol'
is a copy of current->mempolicy, and it won't be going away.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Linus Torvalds


On Tue, 22 Feb 2005, Olof Johansson wrote:
> 
> Consider a small testcase that spawns off two threads, either thread
> doing a loop of:
> 
>   buf = mmap /dev/zero MAP_SHARED for 0x10 bytes
>   call sys_futex (buf+page, FUTEX_WAIT, 1, NULL, NULL) for each page in 
> said mmap
>   munmap(buf)
>   repeat
> 
> This will quickly lock up, since the futex_wait code dows a
> down_read(mmap_sem), then a get_user().
> 
> The do_page_fault code on ppc64 (as well as other architectures) needs
> to take the same semaphore for reading. This is all good until the
> second thread comes into play: Its mmap call tries to take the same
> semaphore for writing which causes in the do_page_fault down_read()
> to get stuck. Classic deadlock.

It shouldn't be. If one read writer is active, another should be able to 
come in, regardless of any pending writer trying to access it. At least 
that's always been the rule for the rw-spinlocks _exactly_ for this 
reaseon.

The rwsem code tries to be fairer, maybe it has broken this case in the 
name of fairness. I personally think fairness is overrated, and would 
rather have the rwsem implementation let readers come in. But if we really 
don't want that, then:

> One attempt to fix this is included below. It works, but I'm not entirely
> happy with the fact that it's a bit messy solution. If anyone has a
> better idea for how to solve it I'd be all ears.

We have a notion of "atomic get_user()" calls, which is a lot more 
efficient, and is already used for other cases where we have _real_
deadlocks (inode semaphore for reads into shared memory mappigns).

See "__copy_to_user_inatomic()", and in particular note that it
effectively _is_ legal to do user accesses with the preempt-count elevated 
or interrupts disabled to tell the page fault handler not to delay (then 
you have to check the error return, and if it returned an error you do it 
the slow way - drop all locks and do a non-atomic access).

> Auditing other read-takers of mmap_sem, I found one more exposure that
> could be solved by just moving code around.

DavidH - what's the word on nested read-semaphores like this? Are they 
supposed to work (like nested read-spinlocks), or do we need to do the 
things Olof does?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Linus Torvalds


On Tue, 22 Feb 2005, Olof Johansson wrote:
 
 Consider a small testcase that spawns off two threads, either thread
 doing a loop of:
 
   buf = mmap /dev/zero MAP_SHARED for 0x10 bytes
   call sys_futex (buf+page, FUTEX_WAIT, 1, NULL, NULL) for each page in 
 said mmap
   munmap(buf)
   repeat
 
 This will quickly lock up, since the futex_wait code dows a
 down_read(mmap_sem), then a get_user().
 
 The do_page_fault code on ppc64 (as well as other architectures) needs
 to take the same semaphore for reading. This is all good until the
 second thread comes into play: Its mmap call tries to take the same
 semaphore for writing which causes in the do_page_fault down_read()
 to get stuck. Classic deadlock.

It shouldn't be. If one read writer is active, another should be able to 
come in, regardless of any pending writer trying to access it. At least 
that's always been the rule for the rw-spinlocks _exactly_ for this 
reaseon.

The rwsem code tries to be fairer, maybe it has broken this case in the 
name of fairness. I personally think fairness is overrated, and would 
rather have the rwsem implementation let readers come in. But if we really 
don't want that, then:

 One attempt to fix this is included below. It works, but I'm not entirely
 happy with the fact that it's a bit messy solution. If anyone has a
 better idea for how to solve it I'd be all ears.

We have a notion of atomic get_user() calls, which is a lot more 
efficient, and is already used for other cases where we have _real_
deadlocks (inode semaphore for reads into shared memory mappigns).

See __copy_to_user_inatomic(), and in particular note that it
effectively _is_ legal to do user accesses with the preempt-count elevated 
or interrupts disabled to tell the page fault handler not to delay (then 
you have to check the error return, and if it returned an error you do it 
the slow way - drop all locks and do a non-atomic access).

 Auditing other read-takers of mmap_sem, I found one more exposure that
 could be solved by just moving code around.

DavidH - what's the word on nested read-semaphores like this? Are they 
supposed to work (like nested read-spinlocks), or do we need to do the 
things Olof does?

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Andrew Morton
[EMAIL PROTECTED] (Olof Johansson) wrote:

 Hi,
 
 Consider a small testcase that spawns off two threads, either thread
 doing a loop of:
 
   buf = mmap /dev/zero MAP_SHARED for 0x10 bytes
   call sys_futex (buf+page, FUTEX_WAIT, 1, NULL, NULL) for each page in 
 said mmap
   munmap(buf)
   repeat
 
 This will quickly lock up, since the futex_wait code dows a
 down_read(mmap_sem), then a get_user().
 
 The do_page_fault code on ppc64 (as well as other architectures) needs
 to take the same semaphore for reading. This is all good until the
 second thread comes into play: Its mmap call tries to take the same
 semaphore for writing which causes in the do_page_fault down_read()
 to get stuck. Classic deadlock.

Yup.  Jamie says that the futex code _has_ to hold mmap_sem across the
get_user().  I forget (but could probably locate) the details.


 One attempt to fix this is included below. It works, but I'm not entirely
 happy with the fact that it's a bit messy solution. If anyone has a
 better idea for how to solve it I'd be all ears.

It's fairly sane.  Style-wise I'd be inclined to turn this:

down_read(current-mm-mmap_sem);
while (!check_user_page_readable(current-mm, uaddr1)) {
up_read(current-mm-mmap_sem);
/* Fault in the page through get_user() but discard result */
if (get_user(curval, (int __user *)uaddr1) != 0)
return -EFAULT;
down_read(current-mm-mmap_sem);
}

into a standalone helper function.

 --- linux-2.5.orig/mm/mempolicy.c 2005-02-04 00:27:40.0 -0600
 +++ linux-2.5/mm/mempolicy.c  2005-02-21 16:43:08.0 -0600
 @@ -486,6 +486,7 @@
   struct mm_struct *mm = current-mm;
   struct vm_area_struct *vma = NULL;
   struct mempolicy *pol = current-mempolicy;
 + DECLARE_BITMAP(nodes, MAX_NUMNODES);
  
   if (flags  ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
   return -EINVAL;
 @@ -524,16 +525,21 @@
   } else
   pval = pol-policy;
  
 + if (nmask)
 + get_zonemask(pol, nodes);
 +
 + if (vma) {
 + up_read(current-mm-mmap_sem);
 + vma = NULL;
 + }
 +

OK.

   err = -EFAULT;
   if (policy  put_user(pval, policy))
   goto out;
  
   err = 0;
 - if (nmask) {
 - DECLARE_BITMAP(nodes, MAX_NUMNODES);
 - get_zonemask(pol, nodes);
 + if (nmask)
   err = copy_nodes_to_user(nmask, maxnode, nodes, sizeof(nodes));
 - }
  
   out:
   if (vma)

I don't think we need to hold mmap_sem while running get_zonemask().  `pol'
is a copy of current-mempolicy, and it won't be going away.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Jamie Lokier
Andrew Morton wrote:
  This will quickly lock up, since the futex_wait code dows a
  down_read(mmap_sem), then a get_user().
  
  The do_page_fault code on ppc64 (as well as other architectures) needs
  to take the same semaphore for reading. This is all good until the
  second thread comes into play: Its mmap call tries to take the same
  semaphore for writing which causes in the do_page_fault down_read()
  to get stuck. Classic deadlock.
 
 Yup.  Jamie says that the futex code _has_ to hold mmap_sem across the
 get_user().  I forget (but could probably locate) the details.

It does - the key which identifies a futex depends on a vma
calculation, and the vma must not change between the calculation and
the get_user().

  One attempt to fix this is included below. It works, but I'm not entirely
  happy with the fact that it's a bit messy solution. If anyone has a
  better idea for how to solve it I'd be all ears.
 
 It's fairly sane.  Style-wise I'd be inclined to turn this:
 
   down_read(current-mm-mmap_sem);
   while (!check_user_page_readable(current-mm, uaddr1)) {
   up_read(current-mm-mmap_sem);
   /* Fault in the page through get_user() but discard result */
   if (get_user(curval, (int __user *)uaddr1) != 0)
   return -EFAULT;
   down_read(current-mm-mmap_sem);
   }

That won't work because the vma lock must be help between key
calculation and get_user() - otherwise futex is not reliable.  It
would work if the futex key calculation was inside the loop.

A much simpler solution (and sorry for not offering it earlier,
because Andrew Morton pointed out this bug long ago, but I was busy), is:

In futex.c:

down_read(current-mm-mmap_sem);
get_futex_key(...) etc.
queue_me(...) etc.
current-flags |= PF_MMAP_SEM; - new
ret = get_user(...);
current-flags = PF_MMAP_SEM; - new
/* the rest */

And in arch/*/mm/fault.c, replace every one of these:

down_read(mm-mmap_sem);

up_read(mm-mmap_sem);

with these:

if (!(current  PF_MMAP_SEM))
down_read(mm-mmap_sem);

if (!(current  PF_MMAP_SEM))
up_read(mm-mmap_sem);

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Benjamin Herrenschmidt
On Tue, 2005-02-22 at 11:36 -0800, Linus Torvalds wrote:

 DavidH - what's the word on nested read-semaphores like this? Are they 
 supposed to work (like nested read-spinlocks), or do we need to do the 
 things Olof does?

Isn't Olof scheme racy ? Can't the stuff get swapped out between the
first get_user() and the real one ?

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Chris Friesen
Jamie Lokier wrote:
In futex.c:
down_read(current-mm-mmap_sem);
get_futex_key(...) etc.
queue_me(...) etc.
current-flags |= PF_MMAP_SEM; - new
ret = get_user(...);
current-flags = PF_MMAP_SEM; - new
/* the rest */
Should the second new line be this (with the inverse)?
current-flags = ~PF_MMAP_SEM;
Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Benjamin Herrenschmidt
On Wed, 2005-02-23 at 08:16 +1100, Benjamin Herrenschmidt wrote:
 On Tue, 2005-02-22 at 11:36 -0800, Linus Torvalds wrote:
 
  DavidH - what's the word on nested read-semaphores like this? Are they 
  supposed to work (like nested read-spinlocks), or do we need to do the 
  things Olof does?
 
 Isn't Olof scheme racy ? Can't the stuff get swapped out between the
 first get_user() and the real one ?

Forget it, I missed the check_user_page_readable() guy within the
semaphore protection. I didn't know that function ;)

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Olof Johansson
On Tue, Feb 22, 2005 at 09:07:52PM +, Jamie Lokier wrote:
 
 That won't work because the vma lock must be help between key
 calculation and get_user() - otherwise futex is not reliable.  It
 would work if the futex key calculation was inside the loop.

Sure, but that's still true: It's just that the get_user() is done twice
instead. The semaphore is never released between the key calculation and
the real get_user().

 A much simpler solution (and sorry for not offering it earlier,
 because Andrew Morton pointed out this bug long ago, but I was busy), is:

Either way works for me. Andrew/Linus, got a preference? I'll either
post my refresh based on Andrews comments, or code up Jamie's
suggestion.


-Olof
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Jamie Lokier
Chris Friesen wrote:
  down_read(current-mm-mmap_sem);
  get_futex_key(...) etc.
  queue_me(...) etc.
  current-flags |= PF_MMAP_SEM; - new
  ret = get_user(...);
  current-flags = PF_MMAP_SEM; - new
  /* the rest */
 
 Should the second new line be this (with the inverse)?
 
   current-flags = ~PF_MMAP_SEM;

Quiet!  I was trying to sneak in a security hole! :)

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Linus Torvalds


On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote:
 
 Isn't Olof scheme racy ? Can't the stuff get swapped out between the
 first get_user() and the real one ?

Yes. But see my suggested modification (which I still think is the thing 
that Olof does, except it's more efficient and avoids the race).

If rwsems acted like rwlocks, we wouldn't have this issue at all.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Linus Torvalds


On Tue, 22 Feb 2005, Jamie Lokier wrote:
 
 A much simpler solution (and sorry for not offering it earlier,
 because Andrew Morton pointed out this bug long ago, but I was busy), is:
 
 In futex.c:
 
   down_read(current-mm-mmap_sem);
   get_futex_key(...) etc.
   queue_me(...) etc.
   current-flags |= PF_MMAP_SEM; - new
   ret = get_user(...);
   current-flags = PF_MMAP_SEM; - new
   /* the rest */

That is uglee. 

We really have this already, and it's called current-preempt. It 
handles any lock at all, and doesn't add yet another special case to all 
the architectures.

Just do

repeat:
down_read(current-mm-mmap_sem);
get_futex_key(...) etc.
queue_me(...) etc.
inc_preempt_count();
ret = get_user(...);
dec_preempt_count();
if (unlikely(ret)) {
up_read(current-mm-mmap_sem);
/* Re-do the access outside the lock */
ret = get_user(...);
if (!ret)
goto repeat;
return ret;
}
...

and you should be ok.

No new special cases, no new abstractions. At most, we should probably 
create a get_user_inatomic(), to 

 - make it damn obvious what we're doing, and match the explicit
   inatomic in the other place where we depend on this (fs/filemap.c)

 - allow the regular get_user() to continue to do the normal
   might_sleep() checks.

That's assuming we can't just make rwsem's nest nicely.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Andrew Morton
Jamie Lokier [EMAIL PROTECTED] wrote:

 ...
 
   One attempt to fix this is included below. It works, but I'm not entirely
   happy with the fact that it's a bit messy solution. If anyone has a
   better idea for how to solve it I'd be all ears.
  
  It's fairly sane.  Style-wise I'd be inclined to turn this:
  
  down_read(current-mm-mmap_sem);
  while (!check_user_page_readable(current-mm, uaddr1)) {
  up_read(current-mm-mmap_sem);
  /* Fault in the page through get_user() but discard result */
  if (get_user(curval, (int __user *)uaddr1) != 0)
  return -EFAULT;
  down_read(current-mm-mmap_sem);
  }
 
 That won't work because the vma lock must be help between key
 calculation and get_user() - otherwise futex is not reliable.  It
 would work if the futex key calculation was inside the loop.

All the above is trying to do is to convert the initial down_read(mmap_sem)
into a function which, on exit, guarantees that

a) down_read(mmap_sem) is held and

b) the subsequent get_user() of that address will not generate a pagefault.

So it shouldn't affect the futex code's atomicity at all.

However the pte can get unmapped by memory reclaim so we could still take a
minor fault, and hit the same deadlock, yes?

 A much simpler solution (and sorry for not offering it earlier,
 because Andrew Morton pointed out this bug long ago, but I was busy), is:
 
 In futex.c:
 
   down_read(current-mm-mmap_sem);
   get_futex_key(...) etc.
   queue_me(...) etc.
   current-flags |= PF_MMAP_SEM; - new
   ret = get_user(...);
   current-flags = PF_MMAP_SEM; - new
   /* the rest */
 
 And in arch/*/mm/fault.c, replace every one of these:
 
   down_read(mm-mmap_sem);
 
   up_read(mm-mmap_sem);
 
 with these:
 
   if (!(current  PF_MMAP_SEM))
   down_read(mm-mmap_sem);
 
   if (!(current  PF_MMAP_SEM))
   up_read(mm-mmap_sem);
 

Yes, that will work.  However I do feel that it's cleaner to localise this
nastiness into a single function which the futex code calls, rather than
spreading it all around and adding overhead to every pagefault.  If we can
work out how.

wrt this down_read/down_write/down_read deadlock: iirc, the reason why
down_write() takes precedence over down_read() is to avoid the permanent
writer starvation which would occur if there is heavy down_read() traffic.

As Linus points out, an alternative would be to do an inc_preempt_count()
around the offending get_user(), then use __copy_from_user_inatomic(), then
take some sort of remedial action if __copy_from_user_inatomic() returns a
fault.  Something like:

retry:
if (get_user(uaddr) == -EFAULT)
return -EFAULT;
down_read(mmap_sem);
inc_preempt_count();
if (__copy_from_user_inatomic(..., uaddr)) {
up_read(mmap_sem);
dec_preempt_count();
goto retry;
}

dec_preempt_count();
up_read(mmap_sem);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Benjamin Herrenschmidt
On Tue, 2005-02-22 at 13:31 -0800, Linus Torvalds wrote:
 
 On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote:
  
  Isn't Olof scheme racy ? Can't the stuff get swapped out between the
  first get_user() and the real one ?
 
 Yes. But see my suggested modification (which I still think is the thing 
 that Olof does, except it's more efficient and avoids the race).
 
 If rwsems acted like rwlocks, we wouldn't have this issue at all.

Yours is probably the most efficient too. Note sure what is best for
rwsems tho, there seem to be some interest preventing readers from
starving writers for ever, this has been debated endlessly iirc,
though I have no personal opinion there.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Linus Torvalds


On Tue, 22 Feb 2005, Andrew Morton wrote:
 
 However the pte can get unmapped by memory reclaim so we could still take a
 minor fault, and hit the same deadlock, yes?

You _could_ fix that by getting the pagetable spinlock, I guess. Which
check_user_page_readable() assumes you'd be holding anyway (not holding it
would appear to be a bug).

At which point you might as well just walk the tables by hand and just do 
the read that way. Of course, then you have virtual aliasing issues etc.

Insane, but possible.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Jamie Lokier
Olof Johansson wrote:
  That won't work because the vma lock must be help between key
  calculation and get_user() - otherwise futex is not reliable.  It
  would work if the futex key calculation was inside the loop.
 
 Sure, but that's still true: It's just that the get_user() is done twice
 instead. The semaphore is never released between the key calculation and
 the real get_user().

Ah, I didn't look at where the loop is used and didn't think there'd
be _two_ get_user() calls in the fast case.  Not my instinct.

  A much simpler solution (and sorry for not offering it earlier,
  because Andrew Morton pointed out this bug long ago, but I was busy), is:
 
 Either way works for me. Andrew/Linus, got a preference? I'll either
 post my refresh based on Andrews comments, or code up Jamie's
 suggestion.

Yours has a couple of problems.

   1. It'll make futex waits somewhat slower.  One of the nicer features
  of 2.6 futexes is that we got rid of the explicit page table lookup.

   2. It's broken because a page can be paged out by another thread
  after you've forced it in and before the get_user().  We only
  take mmap_sem, not the page table lock.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Linus Torvalds


On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote:
 
 Yours is probably the most efficient too. Note sure what is best for
 rwsems tho, there seem to be some interest preventing readers from
 starving writers for ever, this has been debated endlessly iirc,
 though I have no personal opinion there.

Yes, the starvation issue is potentially real. And thinking about it,
we've even had that in real life, with /proc and lots of page faults. So I
guess that's a strong argument for the fairness thing.

Oh, well. The reason I hate the rwsem behaviour is exactly because it
results in this very subtle class of deadlocks. This one case is certainly
solvable several ways, but do we have other issues somewhere else? Things
like kobject might be ripe with things like this. The mm semaphore tends
to be pretty well-behaved - and I'm not sure the same is true of the
kobject one.

Normal recursive deadlocks are wonderful - most of them show up
immediately, so assuming you just have enough coverage, you're fine. This
fairness-related deadlock requires a race to happen.

Maybe it would be sufficient to have a debugging version of rwsems that
just notice recursion?

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Benjamin Herrenschmidt
On Tue, 2005-02-22 at 14:10 -0800, Linus Torvalds wrote:

 Oh, well. The reason I hate the rwsem behaviour is exactly because it
 results in this very subtle class of deadlocks. This one case is certainly
 solvable several ways, but do we have other issues somewhere else? Things
 like kobject might be ripe with things like this. The mm semaphore tends
 to be pretty well-behaved - and I'm not sure the same is true of the
 kobject one.

We could detect those tho. When the appropriate DEBUG option is set, by
storing a cpumask in the semaphore we could detect if it's already taken
on this cpu...

 Normal recursive deadlocks are wonderful - most of them show up
 immediately, so assuming you just have enough coverage, you're fine. This
 fairness-related deadlock requires a race to happen.

Unless you consider that taking the read semaphore twice on the same CPU
is always a bug, thus the above stuff would work for catching them at
least more often...
 
 Maybe it would be sufficient to have a debugging version of rwsems that
 just notice recursion?
 
   Linus
-- 
Benjamin Herrenschmidt [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Jamie Lokier
Linus Torvalds wrote:
  queue_me(...) etc.
  current-flags |= PF_MMAP_SEM; - new
  ret = get_user(...);
  current-flags = PF_MMAP_SEM; - new
  /* the rest */
 
 That is uglee. 
 
 We really have this already, and it's called current-preempt. It 
 handles any lock at all, and doesn't add yet another special case to all 
 the architectures.

Ooh, I didn't know current-preempt did that (been away).

   repeat:
   down_read(current-mm-mmap_sem);
   get_futex_key(...) etc.
   queue_me(...) etc.
   inc_preempt_count();
   ret = get_user(...);
   dec_preempt_count();
   if (unlikely(ret)) {
   up_read(current-mm-mmap_sem);
   /* Re-do the access outside the lock */
   ret = get_user(...);
   if (!ret)
   goto repeat;
   return ret;
   }

That would work.  I like it. :)

Page faults will enter the fault handler twice (i.e. slower), but
that's not really a disadvantage, because a program always references
the memory just before calling futex_wait anyway.  A fault is rare.

There is one small but important error: the return ret mustn't just
return.  It must call unqueue_me(q) just like the code at out_unqueue,
_including_ the conditional ret = 0, but _excluding_ the up_read().

Alternatively, since it's a rare case, just shuffle the loop around:

down_read(current-mm-mmap_sem);
repeat:
get_futex_key(...) etc.
queue_me(...) etc.
inc_preempt_count();
ret = get_user(...);
dec_preempt_count();
if (unlikely(ret)) {
up_read(current-mm-mmap_sem);
/* Re-do the access outside the lock */
ret = get_user(...);
down_read(current-mm-mmap_sem);
if (!ret)
goto repeat;
goto out_unqueue;
}

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Olof Johansson
On Tue, Feb 22, 2005 at 10:34:57PM +, Jamie Lokier wrote:
 There is one small but important error: the return ret mustn't just
 return.  It must call unqueue_me(q) just like the code at out_unqueue,
 _including_ the conditional ret = 0, but _excluding_ the up_read().

Not only that, but someone might already have dequeued us, right? It's
probably a pathological case though, i.e. someone did a wake on the same
(bad) address.

How's this patch? It's closer to Linus' pseudo-code than Andrew's, to
avoid the extra get_user() at function entry and keep the common case
path short.

It also includes the feedback from Andrew on the sys_get_mempolicy(),
making the patch even simpler there.




Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retry.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson [EMAIL PROTECTED]


Index: linux-2.5/kernel/futex.c
===
--- linux-2.5.orig/kernel/futex.c   2005-02-21 16:09:38.0 -0600
+++ linux-2.5/kernel/futex.c2005-02-22 16:38:24.0 -0600
@@ -329,6 +329,7 @@
int ret, drop_count = 0;
unsigned int nqueued;
 
+ retry:
down_read(current-mm-mmap_sem);
 
ret = get_futex_key(uaddr1, key1);
@@ -355,9 +356,19 @@
   before *uaddr1.  */
smp_mb();
 
-   if (get_user(curval, (int __user *)uaddr1) != 0) {
-   ret = -EFAULT;
-   goto out;
+   inc_preempt_count();
+   ret = get_user(curval, (int __user *)uaddr1);
+   dec_preempt_count();
+
+   if (unlikely(ret)) {
+   up_read(current-mm-mmap_sem);
+   /* Re-do the access outside the lock */
+   ret = get_user(curval, (int __user *)uaddr1);
+
+   if (!ret)
+   goto retry;
+
+   return ret;
}
if (curval != *valp) {
ret = -EAGAIN;
@@ -480,6 +491,7 @@
int ret, curval;
struct futex_q q;
 
+ retry:
down_read(current-mm-mmap_sem);
 
ret = get_futex_key(uaddr, q.key);
@@ -508,9 +520,21 @@
 * We hold the mmap semaphore, so the mapping cannot have changed
 * since we looked it up in get_futex_key.
 */
-   if (get_user(curval, (int __user *)uaddr) != 0) {
-   ret = -EFAULT;
-   goto out_unqueue;
+   inc_preempt_count();
+   ret = get_user(curval, (int __user *)uaddr);
+   dec_preempt_count();
+   if (unlikely(ret)) {
+   up_read(current-mm-mmap_sem);
+
+   if (!unqueue_me(q)) /* There's a chance we got woken already */
+   return 0;
+
+   /* Re-do the access outside the lock */
+   ret = get_user(curval, (int __user *)uaddr);
+
+   if (!ret)
+   goto retry;
+   return ret;
}
if (curval != val) {
ret = -EWOULDBLOCK;
Index: linux-2.5/mm/mempolicy.c
===
--- linux-2.5.orig/mm/mempolicy.c   2005-02-04 00:27:40.0 -0600
+++ linux-2.5/mm/mempolicy.c2005-02-22 14:34:19.0 -0600
@@ -524,9 +524,13 @@
} else
pval = pol-policy;
 
-   err = -EFAULT;
+   if (vma) {
+   up_read(current-mm-mmap_sem);
+   vma = NULL;
+   }
+
if (policy  put_user(pval, policy))
-   goto out;
+   return -EFAULT;
 
err = 0;
if (nmask) {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Greg KH
On Tue, Feb 22, 2005 at 02:10:58PM -0800, Linus Torvalds wrote:
 
 Oh, well. The reason I hate the rwsem behaviour is exactly because it
 results in this very subtle class of deadlocks. This one case is certainly
 solvable several ways, but do we have other issues somewhere else? Things
 like kobject might be ripe with things like this. The mm semaphore tends
 to be pretty well-behaved - and I'm not sure the same is true of the
 kobject one.

I'm trying to get rid of the kobject (actually the subsystem) rwsem
right now, so it should be gone completly within a few kernel versions.

thanks,

greg k-h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Andrew Morton
[EMAIL PROTECTED] (Olof Johansson) wrote:

 + inc_preempt_count();
 + ret = get_user(curval, (int __user *)uaddr1);
 + dec_preempt_count();

That _should_ generate a might_sleep() warning, except it looks like we
forgot to add a check to get_user().

It would be better to use __copy_from_user_inatomic() here, I think.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Olof Johansson
On Tue, Feb 22, 2005 at 03:20:27PM -0800, Andrew Morton wrote:
 [EMAIL PROTECTED] (Olof Johansson) wrote:
 
  +   inc_preempt_count();
  +   ret = get_user(curval, (int __user *)uaddr1);
  +   dec_preempt_count();
 
 That _should_ generate a might_sleep() warning, except it looks like we
 forgot to add a check to get_user().
 
 It would be better to use __copy_from_user_inatomic() here, I think.

Thanks for catching that. New rev below.

-

Some futex functions do get_user calls while holding mmap_sem for
reading. If get_user() faults, and another thread happens to be in mmap
(or somewhere else holding waiting on down_write for the same semaphore),
then do_page_fault will deadlock. Most architectures seem to be exposed
to this.

To avoid it, make sure the page is available. If not, release the
semaphore, fault it in and retry.

I also found another exposure by inspection, moving some of the code
around avoids the possible deadlock there.

Signed-off-by: Olof Johansson [EMAIL PROTECTED]


Index: linux-2.5/kernel/futex.c
===
--- linux-2.5.orig/kernel/futex.c   2005-02-21 16:09:38.0 -0600
+++ linux-2.5/kernel/futex.c2005-02-22 17:20:22.0 -0600
@@ -329,6 +329,7 @@
int ret, drop_count = 0;
unsigned int nqueued;
 
+ retry:
down_read(current-mm-mmap_sem);
 
ret = get_futex_key(uaddr1, key1);
@@ -355,9 +356,19 @@
   before *uaddr1.  */
smp_mb();
 
-   if (get_user(curval, (int __user *)uaddr1) != 0) {
-   ret = -EFAULT;
-   goto out;
+   inc_preempt_count();
+   ret = __copy_from_user_inatomic(curval, (int __user *)uaddr1, 
sizeof(int));
+   dec_preempt_count();
+
+   if (unlikely(ret)) {
+   up_read(current-mm-mmap_sem);
+   /* Re-do the access outside the lock */
+   ret = get_user(curval, (int __user *)uaddr1);
+
+   if (!ret)
+   goto retry;
+
+   return ret;
}
if (curval != *valp) {
ret = -EAGAIN;
@@ -480,6 +491,7 @@
int ret, curval;
struct futex_q q;
 
+ retry:
down_read(current-mm-mmap_sem);
 
ret = get_futex_key(uaddr, q.key);
@@ -508,9 +520,21 @@
 * We hold the mmap semaphore, so the mapping cannot have changed
 * since we looked it up in get_futex_key.
 */
-   if (get_user(curval, (int __user *)uaddr) != 0) {
-   ret = -EFAULT;
-   goto out_unqueue;
+   inc_preempt_count();
+   ret = __copy_from_user_inatomic(curval, (int __user *)uaddr, 
sizeof(int));
+   dec_preempt_count();
+   if (unlikely(ret)) {
+   up_read(current-mm-mmap_sem);
+
+   if (!unqueue_me(q)) /* There's a chance we got woken already */
+   return 0;
+
+   /* Re-do the access outside the lock */
+   ret = get_user(curval, (int __user *)uaddr);
+
+   if (!ret)
+   goto retry;
+   return ret;
}
if (curval != val) {
ret = -EWOULDBLOCK;
Index: linux-2.5/mm/mempolicy.c
===
--- linux-2.5.orig/mm/mempolicy.c   2005-02-04 00:27:40.0 -0600
+++ linux-2.5/mm/mempolicy.c2005-02-22 14:34:19.0 -0600
@@ -524,9 +524,13 @@
} else
pval = pol-policy;
 
-   err = -EFAULT;
+   if (vma) {
+   up_read(current-mm-mmap_sem);
+   vma = NULL;
+   }
+
if (policy  put_user(pval, policy))
-   goto out;
+   return -EFAULT;
 
err = 0;
if (nmask) {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/