Re: [PATCH/RFC] Futex mmap_sem deadlock
Olof Johansson wrote: > How's this? I went with get_val_no_fault(), since it isn't really a > get_user.*() any more (ptr being passed in), and no_paging is a little > misleading (not all faults are due to paging). How ironic: I deliberately didn't choose "no_fault" because that function *does* take page faults. It only disables paging operations! :) -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 23 Feb 2005, Olof Johansson wrote: > > How's this? I went with get_val_no_fault(), since it isn't really a > get_user.*() any more (ptr being passed in), and no_paging is a little > misleading (not all faults are due to paging). Applied with minor cosmetic changes. I'm like a dog who likes to pee on things to show his territory, so I changed "get_val_no_fault" to "get_futex_value_locked", and I made sure that the return value is sensible (return 0 or -EFAULT rather than the "__memcpy_from_user()" return value which is how many bytes we couldn't copy). Not that we care (we just check the return value against zero anyway, which is success in both cases), but the compiler should be able to optimize it away, and it might avoid some confusion down the line.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, Feb 23, 2005 at 06:49:46PM +, Jamie Lokier wrote: > Linus Torvalds wrote: > > > I suggest putting it into futex.c, and make it an inline function > > > which takes "u32 __user *". > > > > Agreed, except we've traditionally just made it "int __user *". > > The type signatures in futex.c are a bit mixed up - most places say > "int __user *" but sys_futex() says "u32 __user *". get_futex_key > uses sizeof(u32) to check the address. How's this? I went with get_val_no_fault(), since it isn't really a get_user.*() any more (ptr being passed in), and no_paging is a little misleading (not all faults are due to paging). - Some futex functions do get_user calls while holding mmap_sem for reading. If get_user() faults, and another thread happens to be in mmap (or somewhere else holding waiting on down_write for the same semaphore), then do_page_fault will deadlock. Most architectures seem to be exposed to this. To avoid it, make sure the page is available. If not, release the semaphore, fault it in and retry. I also found another exposure by inspection, moving some of the code around avoids the possible deadlock there. Signed-off-by: Olof Johansson <[EMAIL PROTECTED]> Index: linux-2.5/kernel/futex.c === --- linux-2.5.orig/kernel/futex.c 2005-02-21 16:09:38.0 -0600 +++ linux-2.5/kernel/futex.c2005-02-23 13:10:16.0 -0600 @@ -258,6 +258,18 @@ } } +static inline int get_val_no_fault(int *dest, int __user *from) +{ + int ret; + + inc_preempt_count(); + ret = __copy_from_user_inatomic(dest, from, sizeof(int)); + dec_preempt_count(); + preempt_check_resched(); + + return ret; +} + /* * The hash bucket lock must be held when this is called. * Afterwards, the futex_q must not be accessed. @@ -329,6 +341,7 @@ int ret, drop_count = 0; unsigned int nqueued; + retry: down_read(>mm->mmap_sem); ret = get_futex_key(uaddr1, ); @@ -355,9 +368,20 @@ before *uaddr1. */ smp_mb(); - if (get_user(curval, (int __user *)uaddr1) != 0) { - ret = -EFAULT; - goto out; + ret = get_val_no_fault(, (int __user *)uaddr1); + + if (unlikely(ret)) { + /* If we would have faulted, release mmap_sem, fault +* it in and start all over again. +*/ + up_read(>mm->mmap_sem); + + ret = get_user(curval, (int __user *)uaddr1); + + if (!ret) + goto retry; + + return ret; } if (curval != *valp) { ret = -EAGAIN; @@ -480,6 +504,7 @@ int ret, curval; struct futex_q q; + retry: down_read(>mm->mmap_sem); ret = get_futex_key(uaddr, ); @@ -508,9 +533,23 @@ * We hold the mmap semaphore, so the mapping cannot have changed * since we looked it up in get_futex_key. */ - if (get_user(curval, (int __user *)uaddr) != 0) { - ret = -EFAULT; - goto out_unqueue; + + ret = get_val_no_fault(, (int __user *)uaddr); + + if (unlikely(ret)) { + /* If we would have faulted, release mmap_sem, fault it in and +* start all over again. +*/ + up_read(>mm->mmap_sem); + + if (!unqueue_me()) /* There's a chance we got woken already */ + return 0; + + ret = get_user(curval, (int __user *)uaddr); + + if (!ret) + goto retry; + return ret; } if (curval != val) { ret = -EWOULDBLOCK; Index: linux-2.5/mm/mempolicy.c === --- linux-2.5.orig/mm/mempolicy.c 2005-02-04 00:27:40.0 -0600 +++ linux-2.5/mm/mempolicy.c2005-02-23 12:53:22.0 -0600 @@ -524,9 +524,13 @@ } else pval = pol->policy; - err = -EFAULT; + if (vma) { + up_read(>mm->mmap_sem); + vma = NULL; + } + if (policy && put_user(pval, policy)) - goto out; + return -EFAULT; err = 0; if (nmask) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Linus Torvalds wrote: > > I suggest putting it into futex.c, and make it an inline function > > which takes "u32 __user *". > > Agreed, except we've traditionally just made it "int __user *". The type signatures in futex.c are a bit mixed up - most places say "int __user *" but sys_futex() says "u32 __user *". get_futex_key uses sizeof(u32) to check the address. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Olof Johansson <[EMAIL PROTECTED]> wrote: > > Alternately, you could just have do_page_fault() do: > > > > while (!down_read_trylock(>mm->mmap_sem)) > > continue; > > > > However, note that this can suffer from starvation due to a never ending > > flow of mixed write-locks and read-locks on other CPUs. Unlikely, true, > > but not impossible. > > How can this help? > > The semaphore is held for reading by the thread that faulted in > futex_wait() -> get_user(), so no writers will be let through. Until the > writer has been let through, the down_read_trylock will never succeed > either. No forward progress can be made even with the above loop. You're right. The "writers" would have to spin instead. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, Feb 23, 2005 at 06:22:04PM +, Jamie Lokier wrote: > Olof Johansson wrote: > > On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote: > > > > > > Otherwise, a preempt attempt in get_user would not be seen > > > > until some future preempt_enable was executed. > > > > > > True. I guess we should have a "preempt_check_resched()" there too. > > > That's > > > what "kunmap_atomic()" does too (which is what we rely on in the other > > > case we do this..) > > > > Ok, this is getting complex enough to warrant get_user_inatomic(), > > which means adding it to every arch's uaccess.h. > > > > Below patch does so. Unfortunately I don't have a Viro setup with cross > > compilers for nearly every arch, so I can't make sure it doesn't break > > anything. But since I pasted the same code everywhere it shouldn't. > > My turn to say uglee. Yeah, I wasn't entirely happy about it, but it seems that suggestions are coming in on how to do it better. :-) > Firstly, get_user_inatomic is the wrong name. > > "inatomic" in __copy_from_user_inatomic means it's called inside a > non-premptable region (in atomic...). > > Your macro get_user_inatomic is _not_ called inside a > non-preemptable region, so it shouldn't be called "inatomic". > > (A better name is get_user_no_paging). > > Secondly, does this _one_ use (it's not likely to be used elsewhere) > justify copying & pasting the same code into every asm-*/uaccess, > especially when the code is not in any way arch-specific? Arjan suggested creating a linux/uaccess.h that includes asm/uaccess.h, and start moving the users over since that's where the trend is moving anyway (avoiding including asm/* from common code). futex.c would be the first user, and could be followed by more later as a janitorial patch. That'd mean only one addition of the common function instead of having to add it to every arch. > I suggest putting it into futex.c, and make it an inline function > which takes "u32 __user *". Sure, for now that's good enough. Above janitorial work could be done later, if more users get introduced. -Olof - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 23 Feb 2005, Jamie Lokier wrote: > > I suggest putting it into futex.c, and make it an inline function > which takes "u32 __user *". Agreed, except we've traditionally just made it "int __user *". Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Olof Johansson wrote: > On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote: > > > > Otherwise, a preempt attempt in get_user would not be seen > > > until some future preempt_enable was executed. > > > > True. I guess we should have a "preempt_check_resched()" there too. That's > > what "kunmap_atomic()" does too (which is what we rely on in the other > > case we do this..) > > Ok, this is getting complex enough to warrant get_user_inatomic(), > which means adding it to every arch's uaccess.h. > > Below patch does so. Unfortunately I don't have a Viro setup with cross > compilers for nearly every arch, so I can't make sure it doesn't break > anything. But since I pasted the same code everywhere it shouldn't. My turn to say uglee. Firstly, get_user_inatomic is the wrong name. "inatomic" in __copy_from_user_inatomic means it's called inside a non-premptable region (in atomic...). Your macro get_user_inatomic is _not_ called inside a non-preemptable region, so it shouldn't be called "inatomic". (A better name is get_user_no_paging). Secondly, does this _one_ use (it's not likely to be used elsewhere) justify copying & pasting the same code into every asm-*/uaccess, especially when the code is not in any way arch-specific? I suggest putting it into futex.c, and make it an inline function which takes "u32 __user *". -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 2005-02-23 at 11:10 -0600, Olof Johansson wrote: > On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote: > > > > Otherwise, a preempt attempt in get_user would not be seen > > > until some future preempt_enable was executed. > > > > True. I guess we should have a "preempt_check_resched()" there too. That's > > what "kunmap_atomic()" does too (which is what we rely on in the other > > case we do this..) > > Ok, this is getting complex enough to warrant get_user_inatomic(), > which means adding it to every arch's uaccess.h. > > Below patch does so. Unfortunately I don't have a Viro setup with cross > compilers for nearly every arch, so I can't make sure it doesn't break > anything. But since I pasted the same code everywhere it shouldn't. I hate to do this to you, but get_user is a bit horrible in that it is an untyped interface. Fixing it is hard (ugh) but when adding new variants should/could we consider to please make it a typed (eg inline and not a define) interface please ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote: > > Otherwise, a preempt attempt in get_user would not be seen > > until some future preempt_enable was executed. > > True. I guess we should have a "preempt_check_resched()" there too. That's > what "kunmap_atomic()" does too (which is what we rely on in the other > case we do this..) Ok, this is getting complex enough to warrant get_user_inatomic(), which means adding it to every arch's uaccess.h. Below patch does so. Unfortunately I don't have a Viro setup with cross compilers for nearly every arch, so I can't make sure it doesn't break anything. But since I pasted the same code everywhere it shouldn't. - Some futex functions do get_user calls while holding mmap_sem for reading. If get_user() faults, and another thread happens to be in mmap (or somewhere else holding waiting on down_write for the same semaphore), then do_page_fault will deadlock. Most architectures seem to be exposed to this. To avoid it, make sure the page is available. If not, release the semaphore, fault it in and retry. I also found another exposure by inspection, moving some of the code around avoids the possible deadlock there. Signed-off-by: Olof Johansson <[EMAIL PROTECTED]> Index: linux-2.5/kernel/futex.c === --- linux-2.5.orig/kernel/futex.c 2005-02-21 16:09:38.0 -0600 +++ linux-2.5/kernel/futex.c2005-02-23 10:55:37.0 -0600 @@ -329,6 +329,7 @@ int ret, drop_count = 0; unsigned int nqueued; + retry: down_read(>mm->mmap_sem); ret = get_futex_key(uaddr1, ); @@ -355,9 +356,17 @@ before *uaddr1. */ smp_mb(); - if (get_user(curval, (int __user *)uaddr1) != 0) { - ret = -EFAULT; - goto out; + ret = get_user_inatomic(curval, (int __user *)uaddr1); + + if (unlikely(ret)) { + up_read(>mm->mmap_sem); + /* Re-do the access outside the lock */ + ret = get_user(curval, (int __user *)uaddr1); + + if (!ret) + goto retry; + + return ret; } if (curval != *valp) { ret = -EAGAIN; @@ -480,6 +489,7 @@ int ret, curval; struct futex_q q; + retry: down_read(>mm->mmap_sem); ret = get_futex_key(uaddr, ); @@ -508,9 +518,21 @@ * We hold the mmap semaphore, so the mapping cannot have changed * since we looked it up in get_futex_key. */ - if (get_user(curval, (int __user *)uaddr) != 0) { - ret = -EFAULT; - goto out_unqueue; + + ret = get_user_inatomic(curval, (int __user *)uaddr); + + if (unlikely(ret)) { + up_read(>mm->mmap_sem); + + if (!unqueue_me()) /* There's a chance we got woken already */ + return 0; + + /* Re-do the access outside the lock */ + ret = get_user(curval, (int __user *)uaddr); + + if (!ret) + goto retry; + return ret; } if (curval != val) { ret = -EWOULDBLOCK; Index: linux-2.5/mm/mempolicy.c === --- linux-2.5.orig/mm/mempolicy.c 2005-02-04 00:27:40.0 -0600 +++ linux-2.5/mm/mempolicy.c2005-02-23 10:16:46.0 -0600 @@ -524,9 +524,13 @@ } else pval = pol->policy; - err = -EFAULT; + if (vma) { + up_read(>mm->mmap_sem); + vma = NULL; + } + if (policy && put_user(pval, policy)) - goto out; + return -EFAULT; err = 0; if (nmask) { Index: linux-2.5/include/asm-ppc64/uaccess.h === --- linux-2.5.orig/include/asm-ppc64/uaccess.h 2005-02-04 00:27:13.0 -0600 +++ linux-2.5/include/asm-ppc64/uaccess.h 2005-02-23 11:01:44.0 -0600 @@ -217,6 +217,17 @@ : "=r"(err), "=r"(x)\ : "b"(addr), "i"(errret), "0"(err)) + +#define get_user_inatomic(x,ptr) \ +({ \ + int __ret; \ + inc_preempt_count();\ + __ret = __copy_from_user_inatomic(&(x),(ptr),sizeof(*(ptr))); \ + dec_preempt_count();\ + preempt_check_resched();\ + __ret; \ +}) + /* more complex routines
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, Feb 23, 2005 at 11:39:08AM +, David Howells wrote: > Alternately, you could just have do_page_fault() do: > > while (!down_read_trylock(>mm->mmap_sem)) > continue; > > However, note that this can suffer from starvation due to a never ending flow > of mixed write-locks and read-locks on other CPUs. Unlikely, true, but not > impossible. How can this help? The semaphore is held for reading by the thread that faulted in futex_wait() -> get_user(), so no writers will be let through. Until the writer has been let through, the down_read_trylock will never succeed either. No forward progress can be made even with the above loop. -Olof - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 23 Feb 2005, Joe Korty wrote: > > Perhaps this should be preempt_disable preempt_enable. No, the problem with preempt_disable/enable is that they go away if preemption is not enabled. So you really do have to do it by hand with the "inc_preempt_count". > Otherwise, a preempt attempt in get_user would not be seen > until some future preempt_enable was executed. True. I guess we should have a "preempt_check_resched()" there too. That's what "kunmap_atomic()" does too (which is what we rely on in the other case we do this..) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, Feb 22, 2005 at 01:30:27PM -0800, Linus Torvalds wrote: > > We really have this already, and it's called "current->preempt". It > handles any lock at all, and doesn't add yet another special case to all > the architectures. > > Just do > > repeat: > down_read(>mm->mmap_sem); > get_futex_key(...) etc. > queue_me(...) etc. > inc_preempt_count(); > ret = get_user(...); > dec_preempt_count(); Perhaps this should be preempt_disable preempt_enable. Otherwise, a preempt attempt in get_user would not be seen until some future preempt_enable was executed. Regards, Joe -- "Money can buy bandwidth, but latency is forever" -- John Mashey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Andrew Morton <[EMAIL PROTECTED]> wrote: > wrt this down_read/down_write/down_read deadlock: iirc, the reason why > down_write() takes precedence over down_read() is to avoid the permanent > writer starvation which would occur if there is heavy down_read() traffic. down_write() doesn't actually take precedence over down_read() as such. The rwsems try to implement a strict FIFO queue (the fairness thing). David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Linus Torvalds <[EMAIL PROTECTED]> wrote: > That is uglee. True. You could just wrap it up in inline functions and hide it in a header file as I suggested in the email I've just sent. > We really have this already, and it's called "current->preempt". It > handles any lock at all, and doesn't add yet another special case to all > the architectures. > > Just do > > repeat: Alternately, you could just have do_page_fault() do: while (!down_read_trylock(>mm->mmap_sem)) continue; However, note that this can suffer from starvation due to a never ending flow of mixed write-locks and read-locks on other CPUs. Unlikely, true, but not impossible. There is yet another way, now that I think about it... Another rwsem could be added to mm_struct, one that you have to get immediately before mmap_sem and release immediately after; one that page fault doesn't ever touch... but that isn't very pleasant either:-/ A yet further way would be to make a second kind of rwsem; one that's unfair. But if you do that, a few threads that're swapping a lot, combined with a few tops could starve out another thread that's trying to do an mmap... Actually, there's an added benefit to your suggested method... It doesn't involve changing what's in task_struct or mm_struct and doesn't involve changing the semantics of the mmap_sem lock. > That's assuming we can't just make rwsem's nest nicely. rwsems themselves? No, not really. It'd either mean keeing track of which task holds what sort of lock on every rwsem (which'd be nice for debugging, granted; but not something you really want in a normal system), or it'd mean making rwsems unfair - something I think will be a really bad idea. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Linus Torvalds <[EMAIL PROTECTED]> wrote: > It shouldn't be. If one read writer is active, another should be able to > come in, regardless of any pending writer trying to access it. At least > that's always been the rule for the rw-spinlocks _exactly_ for this > reaseon. But not with rw-semaphores. I've designed them to be as fair as I can possibly make them. This means holding reads up if there's a write pending. > The rwsem code tries to be fairer, maybe it has broken this case in the > name of fairness. I personally think fairness is overrated, and would > rather have the rwsem implementation let readers come in. I've seen writer starvation happening due to a continuous flow of overlapping reads... That's one of the reasons I reimplemented rwsems. Also, fair rwsems are easier to provide an assembly optimised form for that doesn't involve the "thundering-herd" approach. Obviously, with the spinlock approach, you can implement any semantics you desire once you're holding the spinlock. With the optimised form as implemented, you can't determine how many active readers there are. This information isn't stored anywhere. > We have a notion of "atomic get_user()" calls, which is a lot more > efficient, and is already used for other cases where we have _real_ > deadlocks (inode semaphore for reads into shared memory mappigns). That's okay, provided the data you're trying to access isn't lurking on a disk somewhere. > > Auditing other read-takers of mmap_sem, I found one more exposure that > > could be solved by just moving code around. > > DavidH - what's the word on nested read-semaphores like this? Are they > supposed to work (like nested read-spinlocks), or do we need to do the > things Olof does? Nesting rwsems like this is definitely on the not-recommended list. For the special case of the mmap semaphore, I'd advocate following something like Jamie Lokier's solution, and note when a task holds its own mmap_sem semaphore read-locked such that page-fault can avoid taking it again. Something like: struct task_struct { ... unsigned mmsem_nest; ... }; static inline void lock_mm_for_read(struct mm_struct *mm) { if (current->mm == mm && current->mmsem_nest++ != 0) ; else down_read(>mmap_sem); } static inline void unlock_mm_for_read(struct mm_struct *mm) { if (current->mm == mm && --current->mmsem_nest != 0) ; else up_read(>mmap_sem); } static inline void lock_mm_for_write(struct mm_struct *mm) { down_write(>mmap_sem); } static inline void unlock_mm_for_write(struct mm_struct *mm) { up_write(>mmap_sem); } Though I'd be tempted to say that faulting is the only case in which recursion is permitted, and change the first two functions to reflect this and add an extra pair specially for page-fault: static inline void lock_mm_for_read(struct mm_struct *mm) { down_read(>mmap_sem); if (current->mm == mm && current->flags |= PF_MMAP_SEM) } static inline void unlock_mm_for_read(struct mm_struct *mm) { if (current->mm == mm && current->flags &= ~PF_MMAP_SEM) up_read(>mmap_sem); } static inline void lock_mm_for_fault(struct mm_struct *mm) { if (!(current->flags & PF_MMAP_SEM)) down_read(>mmap_sem); } static inline void unlock_mm_for_fault(struct mm_struct *mm) { if (!(current->flags & PF_MMAP_SEM)) up_read(>mmap_sem); } If current->mm is passed as the argument, the compiler's optimiser should discard the first comparison in the if-statement. Then futex.c would hold: lock_mm_for_read(>mm); get_futex_key(...) etc. queue_me(...) etc. ret = get_user(...); /* the rest */ unlock_mm_for_read(>mm); And do_page_fault() would hold: lock_mm_for_fault(>mm); ... unlock_mm_for_fault(>mm); David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Linus Torvalds [EMAIL PROTECTED] wrote: It shouldn't be. If one read writer is active, another should be able to come in, regardless of any pending writer trying to access it. At least that's always been the rule for the rw-spinlocks _exactly_ for this reaseon. But not with rw-semaphores. I've designed them to be as fair as I can possibly make them. This means holding reads up if there's a write pending. The rwsem code tries to be fairer, maybe it has broken this case in the name of fairness. I personally think fairness is overrated, and would rather have the rwsem implementation let readers come in. I've seen writer starvation happening due to a continuous flow of overlapping reads... That's one of the reasons I reimplemented rwsems. Also, fair rwsems are easier to provide an assembly optimised form for that doesn't involve the thundering-herd approach. Obviously, with the spinlock approach, you can implement any semantics you desire once you're holding the spinlock. With the optimised form as implemented, you can't determine how many active readers there are. This information isn't stored anywhere. We have a notion of atomic get_user() calls, which is a lot more efficient, and is already used for other cases where we have _real_ deadlocks (inode semaphore for reads into shared memory mappigns). That's okay, provided the data you're trying to access isn't lurking on a disk somewhere. Auditing other read-takers of mmap_sem, I found one more exposure that could be solved by just moving code around. DavidH - what's the word on nested read-semaphores like this? Are they supposed to work (like nested read-spinlocks), or do we need to do the things Olof does? Nesting rwsems like this is definitely on the not-recommended list. For the special case of the mmap semaphore, I'd advocate following something like Jamie Lokier's solution, and note when a task holds its own mmap_sem semaphore read-locked such that page-fault can avoid taking it again. Something like: struct task_struct { ... unsigned mmsem_nest; ... }; static inline void lock_mm_for_read(struct mm_struct *mm) { if (current-mm == mm current-mmsem_nest++ != 0) ; else down_read(mm-mmap_sem); } static inline void unlock_mm_for_read(struct mm_struct *mm) { if (current-mm == mm --current-mmsem_nest != 0) ; else up_read(mm-mmap_sem); } static inline void lock_mm_for_write(struct mm_struct *mm) { down_write(mm-mmap_sem); } static inline void unlock_mm_for_write(struct mm_struct *mm) { up_write(mm-mmap_sem); } Though I'd be tempted to say that faulting is the only case in which recursion is permitted, and change the first two functions to reflect this and add an extra pair specially for page-fault: static inline void lock_mm_for_read(struct mm_struct *mm) { down_read(mm-mmap_sem); if (current-mm == mm current-flags |= PF_MMAP_SEM) } static inline void unlock_mm_for_read(struct mm_struct *mm) { if (current-mm == mm current-flags = ~PF_MMAP_SEM) up_read(mm-mmap_sem); } static inline void lock_mm_for_fault(struct mm_struct *mm) { if (!(current-flags PF_MMAP_SEM)) down_read(mm-mmap_sem); } static inline void unlock_mm_for_fault(struct mm_struct *mm) { if (!(current-flags PF_MMAP_SEM)) up_read(mm-mmap_sem); } If current-mm is passed as the argument, the compiler's optimiser should discard the first comparison in the if-statement. Then futex.c would hold: lock_mm_for_read(current-mm); get_futex_key(...) etc. queue_me(...) etc. ret = get_user(...); /* the rest */ unlock_mm_for_read(current-mm); And do_page_fault() would hold: lock_mm_for_fault(current-mm); ... unlock_mm_for_fault(current-mm); David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Linus Torvalds [EMAIL PROTECTED] wrote: That is uglee. True. You could just wrap it up in inline functions and hide it in a header file as I suggested in the email I've just sent. We really have this already, and it's called current-preempt. It handles any lock at all, and doesn't add yet another special case to all the architectures. Just do repeat: Alternately, you could just have do_page_fault() do: while (!down_read_trylock(current-mm-mmap_sem)) continue; However, note that this can suffer from starvation due to a never ending flow of mixed write-locks and read-locks on other CPUs. Unlikely, true, but not impossible. There is yet another way, now that I think about it... Another rwsem could be added to mm_struct, one that you have to get immediately before mmap_sem and release immediately after; one that page fault doesn't ever touch... but that isn't very pleasant either:-/ A yet further way would be to make a second kind of rwsem; one that's unfair. But if you do that, a few threads that're swapping a lot, combined with a few tops could starve out another thread that's trying to do an mmap... Actually, there's an added benefit to your suggested method... It doesn't involve changing what's in task_struct or mm_struct and doesn't involve changing the semantics of the mmap_sem lock. That's assuming we can't just make rwsem's nest nicely. rwsems themselves? No, not really. It'd either mean keeing track of which task holds what sort of lock on every rwsem (which'd be nice for debugging, granted; but not something you really want in a normal system), or it'd mean making rwsems unfair - something I think will be a really bad idea. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Andrew Morton [EMAIL PROTECTED] wrote: wrt this down_read/down_write/down_read deadlock: iirc, the reason why down_write() takes precedence over down_read() is to avoid the permanent writer starvation which would occur if there is heavy down_read() traffic. down_write() doesn't actually take precedence over down_read() as such. The rwsems try to implement a strict FIFO queue (the fairness thing). David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, Feb 22, 2005 at 01:30:27PM -0800, Linus Torvalds wrote: We really have this already, and it's called current-preempt. It handles any lock at all, and doesn't add yet another special case to all the architectures. Just do repeat: down_read(current-mm-mmap_sem); get_futex_key(...) etc. queue_me(...) etc. inc_preempt_count(); ret = get_user(...); dec_preempt_count(); Perhaps this should be preempt_disable preempt_enable. Otherwise, a preempt attempt in get_user would not be seen until some future preempt_enable was executed. Regards, Joe -- Money can buy bandwidth, but latency is forever -- John Mashey - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 23 Feb 2005, Joe Korty wrote: Perhaps this should be preempt_disable preempt_enable. No, the problem with preempt_disable/enable is that they go away if preemption is not enabled. So you really do have to do it by hand with the inc_preempt_count. Otherwise, a preempt attempt in get_user would not be seen until some future preempt_enable was executed. True. I guess we should have a preempt_check_resched() there too. That's what kunmap_atomic() does too (which is what we rely on in the other case we do this..) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, Feb 23, 2005 at 11:39:08AM +, David Howells wrote: Alternately, you could just have do_page_fault() do: while (!down_read_trylock(current-mm-mmap_sem)) continue; However, note that this can suffer from starvation due to a never ending flow of mixed write-locks and read-locks on other CPUs. Unlikely, true, but not impossible. How can this help? The semaphore is held for reading by the thread that faulted in futex_wait() - get_user(), so no writers will be let through. Until the writer has been let through, the down_read_trylock will never succeed either. No forward progress can be made even with the above loop. -Olof - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote: Otherwise, a preempt attempt in get_user would not be seen until some future preempt_enable was executed. True. I guess we should have a preempt_check_resched() there too. That's what kunmap_atomic() does too (which is what we rely on in the other case we do this..) Ok, this is getting complex enough to warrant get_user_inatomic(), which means adding it to every arch's uaccess.h. Below patch does so. Unfortunately I don't have a Viro setup with cross compilers for nearly every arch, so I can't make sure it doesn't break anything. But since I pasted the same code everywhere it shouldn't. - Some futex functions do get_user calls while holding mmap_sem for reading. If get_user() faults, and another thread happens to be in mmap (or somewhere else holding waiting on down_write for the same semaphore), then do_page_fault will deadlock. Most architectures seem to be exposed to this. To avoid it, make sure the page is available. If not, release the semaphore, fault it in and retry. I also found another exposure by inspection, moving some of the code around avoids the possible deadlock there. Signed-off-by: Olof Johansson [EMAIL PROTECTED] Index: linux-2.5/kernel/futex.c === --- linux-2.5.orig/kernel/futex.c 2005-02-21 16:09:38.0 -0600 +++ linux-2.5/kernel/futex.c2005-02-23 10:55:37.0 -0600 @@ -329,6 +329,7 @@ int ret, drop_count = 0; unsigned int nqueued; + retry: down_read(current-mm-mmap_sem); ret = get_futex_key(uaddr1, key1); @@ -355,9 +356,17 @@ before *uaddr1. */ smp_mb(); - if (get_user(curval, (int __user *)uaddr1) != 0) { - ret = -EFAULT; - goto out; + ret = get_user_inatomic(curval, (int __user *)uaddr1); + + if (unlikely(ret)) { + up_read(current-mm-mmap_sem); + /* Re-do the access outside the lock */ + ret = get_user(curval, (int __user *)uaddr1); + + if (!ret) + goto retry; + + return ret; } if (curval != *valp) { ret = -EAGAIN; @@ -480,6 +489,7 @@ int ret, curval; struct futex_q q; + retry: down_read(current-mm-mmap_sem); ret = get_futex_key(uaddr, q.key); @@ -508,9 +518,21 @@ * We hold the mmap semaphore, so the mapping cannot have changed * since we looked it up in get_futex_key. */ - if (get_user(curval, (int __user *)uaddr) != 0) { - ret = -EFAULT; - goto out_unqueue; + + ret = get_user_inatomic(curval, (int __user *)uaddr); + + if (unlikely(ret)) { + up_read(current-mm-mmap_sem); + + if (!unqueue_me(q)) /* There's a chance we got woken already */ + return 0; + + /* Re-do the access outside the lock */ + ret = get_user(curval, (int __user *)uaddr); + + if (!ret) + goto retry; + return ret; } if (curval != val) { ret = -EWOULDBLOCK; Index: linux-2.5/mm/mempolicy.c === --- linux-2.5.orig/mm/mempolicy.c 2005-02-04 00:27:40.0 -0600 +++ linux-2.5/mm/mempolicy.c2005-02-23 10:16:46.0 -0600 @@ -524,9 +524,13 @@ } else pval = pol-policy; - err = -EFAULT; + if (vma) { + up_read(current-mm-mmap_sem); + vma = NULL; + } + if (policy put_user(pval, policy)) - goto out; + return -EFAULT; err = 0; if (nmask) { Index: linux-2.5/include/asm-ppc64/uaccess.h === --- linux-2.5.orig/include/asm-ppc64/uaccess.h 2005-02-04 00:27:13.0 -0600 +++ linux-2.5/include/asm-ppc64/uaccess.h 2005-02-23 11:01:44.0 -0600 @@ -217,6 +217,17 @@ : =r(err), =r(x)\ : b(addr), i(errret), 0(err)) + +#define get_user_inatomic(x,ptr) \ +({ \ + int __ret; \ + inc_preempt_count();\ + __ret = __copy_from_user_inatomic((x),(ptr),sizeof(*(ptr))); \ + dec_preempt_count();\ + preempt_check_resched();\ + __ret; \ +}) + /* more
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 2005-02-23 at 11:10 -0600, Olof Johansson wrote: On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote: Otherwise, a preempt attempt in get_user would not be seen until some future preempt_enable was executed. True. I guess we should have a preempt_check_resched() there too. That's what kunmap_atomic() does too (which is what we rely on in the other case we do this..) Ok, this is getting complex enough to warrant get_user_inatomic(), which means adding it to every arch's uaccess.h. Below patch does so. Unfortunately I don't have a Viro setup with cross compilers for nearly every arch, so I can't make sure it doesn't break anything. But since I pasted the same code everywhere it shouldn't. I hate to do this to you, but get_user is a bit horrible in that it is an untyped interface. Fixing it is hard (ugh) but when adding new variants should/could we consider to please make it a typed (eg inline and not a define) interface please ? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Olof Johansson wrote: On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote: Otherwise, a preempt attempt in get_user would not be seen until some future preempt_enable was executed. True. I guess we should have a preempt_check_resched() there too. That's what kunmap_atomic() does too (which is what we rely on in the other case we do this..) Ok, this is getting complex enough to warrant get_user_inatomic(), which means adding it to every arch's uaccess.h. Below patch does so. Unfortunately I don't have a Viro setup with cross compilers for nearly every arch, so I can't make sure it doesn't break anything. But since I pasted the same code everywhere it shouldn't. My turn to say uglee. Firstly, get_user_inatomic is the wrong name. inatomic in __copy_from_user_inatomic means it's called inside a non-premptable region (in atomic...). Your macro get_user_inatomic is _not_ called inside a non-preemptable region, so it shouldn't be called inatomic. (A better name is get_user_no_paging). Secondly, does this _one_ use (it's not likely to be used elsewhere) justify copying pasting the same code into every asm-*/uaccess, especially when the code is not in any way arch-specific? I suggest putting it into futex.c, and make it an inline function which takes u32 __user *. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 23 Feb 2005, Jamie Lokier wrote: I suggest putting it into futex.c, and make it an inline function which takes u32 __user *. Agreed, except we've traditionally just made it int __user *. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, Feb 23, 2005 at 06:22:04PM +, Jamie Lokier wrote: Olof Johansson wrote: On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote: Otherwise, a preempt attempt in get_user would not be seen until some future preempt_enable was executed. True. I guess we should have a preempt_check_resched() there too. That's what kunmap_atomic() does too (which is what we rely on in the other case we do this..) Ok, this is getting complex enough to warrant get_user_inatomic(), which means adding it to every arch's uaccess.h. Below patch does so. Unfortunately I don't have a Viro setup with cross compilers for nearly every arch, so I can't make sure it doesn't break anything. But since I pasted the same code everywhere it shouldn't. My turn to say uglee. Yeah, I wasn't entirely happy about it, but it seems that suggestions are coming in on how to do it better. :-) Firstly, get_user_inatomic is the wrong name. inatomic in __copy_from_user_inatomic means it's called inside a non-premptable region (in atomic...). Your macro get_user_inatomic is _not_ called inside a non-preemptable region, so it shouldn't be called inatomic. (A better name is get_user_no_paging). Secondly, does this _one_ use (it's not likely to be used elsewhere) justify copying pasting the same code into every asm-*/uaccess, especially when the code is not in any way arch-specific? Arjan suggested creating a linux/uaccess.h that includes asm/uaccess.h, and start moving the users over since that's where the trend is moving anyway (avoiding including asm/* from common code). futex.c would be the first user, and could be followed by more later as a janitorial patch. That'd mean only one addition of the common function instead of having to add it to every arch. I suggest putting it into futex.c, and make it an inline function which takes u32 __user *. Sure, for now that's good enough. Above janitorial work could be done later, if more users get introduced. -Olof - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Olof Johansson [EMAIL PROTECTED] wrote: Alternately, you could just have do_page_fault() do: while (!down_read_trylock(current-mm-mmap_sem)) continue; However, note that this can suffer from starvation due to a never ending flow of mixed write-locks and read-locks on other CPUs. Unlikely, true, but not impossible. How can this help? The semaphore is held for reading by the thread that faulted in futex_wait() - get_user(), so no writers will be let through. Until the writer has been let through, the down_read_trylock will never succeed either. No forward progress can be made even with the above loop. You're right. The writers would have to spin instead. David - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Linus Torvalds wrote: I suggest putting it into futex.c, and make it an inline function which takes u32 __user *. Agreed, except we've traditionally just made it int __user *. The type signatures in futex.c are a bit mixed up - most places say int __user * but sys_futex() says u32 __user *. get_futex_key uses sizeof(u32) to check the address. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, Feb 23, 2005 at 06:49:46PM +, Jamie Lokier wrote: Linus Torvalds wrote: I suggest putting it into futex.c, and make it an inline function which takes u32 __user *. Agreed, except we've traditionally just made it int __user *. The type signatures in futex.c are a bit mixed up - most places say int __user * but sys_futex() says u32 __user *. get_futex_key uses sizeof(u32) to check the address. How's this? I went with get_val_no_fault(), since it isn't really a get_user.*() any more (ptr being passed in), and no_paging is a little misleading (not all faults are due to paging). - Some futex functions do get_user calls while holding mmap_sem for reading. If get_user() faults, and another thread happens to be in mmap (or somewhere else holding waiting on down_write for the same semaphore), then do_page_fault will deadlock. Most architectures seem to be exposed to this. To avoid it, make sure the page is available. If not, release the semaphore, fault it in and retry. I also found another exposure by inspection, moving some of the code around avoids the possible deadlock there. Signed-off-by: Olof Johansson [EMAIL PROTECTED] Index: linux-2.5/kernel/futex.c === --- linux-2.5.orig/kernel/futex.c 2005-02-21 16:09:38.0 -0600 +++ linux-2.5/kernel/futex.c2005-02-23 13:10:16.0 -0600 @@ -258,6 +258,18 @@ } } +static inline int get_val_no_fault(int *dest, int __user *from) +{ + int ret; + + inc_preempt_count(); + ret = __copy_from_user_inatomic(dest, from, sizeof(int)); + dec_preempt_count(); + preempt_check_resched(); + + return ret; +} + /* * The hash bucket lock must be held when this is called. * Afterwards, the futex_q must not be accessed. @@ -329,6 +341,7 @@ int ret, drop_count = 0; unsigned int nqueued; + retry: down_read(current-mm-mmap_sem); ret = get_futex_key(uaddr1, key1); @@ -355,9 +368,20 @@ before *uaddr1. */ smp_mb(); - if (get_user(curval, (int __user *)uaddr1) != 0) { - ret = -EFAULT; - goto out; + ret = get_val_no_fault(curval, (int __user *)uaddr1); + + if (unlikely(ret)) { + /* If we would have faulted, release mmap_sem, fault +* it in and start all over again. +*/ + up_read(current-mm-mmap_sem); + + ret = get_user(curval, (int __user *)uaddr1); + + if (!ret) + goto retry; + + return ret; } if (curval != *valp) { ret = -EAGAIN; @@ -480,6 +504,7 @@ int ret, curval; struct futex_q q; + retry: down_read(current-mm-mmap_sem); ret = get_futex_key(uaddr, q.key); @@ -508,9 +533,23 @@ * We hold the mmap semaphore, so the mapping cannot have changed * since we looked it up in get_futex_key. */ - if (get_user(curval, (int __user *)uaddr) != 0) { - ret = -EFAULT; - goto out_unqueue; + + ret = get_val_no_fault(curval, (int __user *)uaddr); + + if (unlikely(ret)) { + /* If we would have faulted, release mmap_sem, fault it in and +* start all over again. +*/ + up_read(current-mm-mmap_sem); + + if (!unqueue_me(q)) /* There's a chance we got woken already */ + return 0; + + ret = get_user(curval, (int __user *)uaddr); + + if (!ret) + goto retry; + return ret; } if (curval != val) { ret = -EWOULDBLOCK; Index: linux-2.5/mm/mempolicy.c === --- linux-2.5.orig/mm/mempolicy.c 2005-02-04 00:27:40.0 -0600 +++ linux-2.5/mm/mempolicy.c2005-02-23 12:53:22.0 -0600 @@ -524,9 +524,13 @@ } else pval = pol-policy; - err = -EFAULT; + if (vma) { + up_read(current-mm-mmap_sem); + vma = NULL; + } + if (policy put_user(pval, policy)) - goto out; + return -EFAULT; err = 0; if (nmask) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 23 Feb 2005, Olof Johansson wrote: How's this? I went with get_val_no_fault(), since it isn't really a get_user.*() any more (ptr being passed in), and no_paging is a little misleading (not all faults are due to paging). Applied with minor cosmetic changes. I'm like a dog who likes to pee on things to show his territory, so I changed get_val_no_fault to get_futex_value_locked, and I made sure that the return value is sensible (return 0 or -EFAULT rather than the __memcpy_from_user() return value which is how many bytes we couldn't copy). Not that we care (we just check the return value against zero anyway, which is success in both cases), but the compiler should be able to optimize it away, and it might avoid some confusion down the line.. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Olof Johansson wrote: How's this? I went with get_val_no_fault(), since it isn't really a get_user.*() any more (ptr being passed in), and no_paging is a little misleading (not all faults are due to paging). How ironic: I deliberately didn't choose no_fault because that function *does* take page faults. It only disables paging operations! :) -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, Feb 22, 2005 at 03:20:27PM -0800, Andrew Morton wrote: > [EMAIL PROTECTED] (Olof Johansson) wrote: > > > > + inc_preempt_count(); > > + ret = get_user(curval, (int __user *)uaddr1); > > + dec_preempt_count(); > > That _should_ generate a might_sleep() warning, except it looks like we > forgot to add a check to get_user(). > > It would be better to use __copy_from_user_inatomic() here, I think. Thanks for catching that. New rev below. - Some futex functions do get_user calls while holding mmap_sem for reading. If get_user() faults, and another thread happens to be in mmap (or somewhere else holding waiting on down_write for the same semaphore), then do_page_fault will deadlock. Most architectures seem to be exposed to this. To avoid it, make sure the page is available. If not, release the semaphore, fault it in and retry. I also found another exposure by inspection, moving some of the code around avoids the possible deadlock there. Signed-off-by: Olof Johansson <[EMAIL PROTECTED]> Index: linux-2.5/kernel/futex.c === --- linux-2.5.orig/kernel/futex.c 2005-02-21 16:09:38.0 -0600 +++ linux-2.5/kernel/futex.c2005-02-22 17:20:22.0 -0600 @@ -329,6 +329,7 @@ int ret, drop_count = 0; unsigned int nqueued; + retry: down_read(>mm->mmap_sem); ret = get_futex_key(uaddr1, ); @@ -355,9 +356,19 @@ before *uaddr1. */ smp_mb(); - if (get_user(curval, (int __user *)uaddr1) != 0) { - ret = -EFAULT; - goto out; + inc_preempt_count(); + ret = __copy_from_user_inatomic(, (int __user *)uaddr1, sizeof(int)); + dec_preempt_count(); + + if (unlikely(ret)) { + up_read(>mm->mmap_sem); + /* Re-do the access outside the lock */ + ret = get_user(curval, (int __user *)uaddr1); + + if (!ret) + goto retry; + + return ret; } if (curval != *valp) { ret = -EAGAIN; @@ -480,6 +491,7 @@ int ret, curval; struct futex_q q; + retry: down_read(>mm->mmap_sem); ret = get_futex_key(uaddr, ); @@ -508,9 +520,21 @@ * We hold the mmap semaphore, so the mapping cannot have changed * since we looked it up in get_futex_key. */ - if (get_user(curval, (int __user *)uaddr) != 0) { - ret = -EFAULT; - goto out_unqueue; + inc_preempt_count(); + ret = __copy_from_user_inatomic(, (int __user *)uaddr, sizeof(int)); + dec_preempt_count(); + if (unlikely(ret)) { + up_read(>mm->mmap_sem); + + if (!unqueue_me()) /* There's a chance we got woken already */ + return 0; + + /* Re-do the access outside the lock */ + ret = get_user(curval, (int __user *)uaddr); + + if (!ret) + goto retry; + return ret; } if (curval != val) { ret = -EWOULDBLOCK; Index: linux-2.5/mm/mempolicy.c === --- linux-2.5.orig/mm/mempolicy.c 2005-02-04 00:27:40.0 -0600 +++ linux-2.5/mm/mempolicy.c2005-02-22 14:34:19.0 -0600 @@ -524,9 +524,13 @@ } else pval = pol->policy; - err = -EFAULT; + if (vma) { + up_read(>mm->mmap_sem); + vma = NULL; + } + if (policy && put_user(pval, policy)) - goto out; + return -EFAULT; err = 0; if (nmask) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
[EMAIL PROTECTED] (Olof Johansson) wrote: > > + inc_preempt_count(); > + ret = get_user(curval, (int __user *)uaddr1); > + dec_preempt_count(); That _should_ generate a might_sleep() warning, except it looks like we forgot to add a check to get_user(). It would be better to use __copy_from_user_inatomic() here, I think. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, Feb 22, 2005 at 02:10:58PM -0800, Linus Torvalds wrote: > > Oh, well. The reason I hate the rwsem behaviour is exactly because it > results in this very subtle class of deadlocks. This one case is certainly > solvable several ways, but do we have other issues somewhere else? Things > like kobject might be ripe with things like this. The mm semaphore tends > to be pretty well-behaved - and I'm not sure the same is true of the > kobject one. I'm trying to get rid of the kobject (actually the subsystem) rwsem right now, so it should be gone completly within a few kernel versions. thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, Feb 22, 2005 at 10:34:57PM +, Jamie Lokier wrote: > There is one small but important error: the "return ret" mustn't just > return. It must call unqueue_me() just like the code at out_unqueue, > _including_ the conditional "ret = 0", but _excluding_ the up_read(). Not only that, but someone might already have dequeued us, right? It's probably a pathological case though, i.e. someone did a wake on the same (bad) address. How's this patch? It's closer to Linus' pseudo-code than Andrew's, to avoid the extra get_user() at function entry and keep the common case path short. It also includes the feedback from Andrew on the sys_get_mempolicy(), making the patch even simpler there. Some futex functions do get_user calls while holding mmap_sem for reading. If get_user() faults, and another thread happens to be in mmap (or somewhere else holding waiting on down_write for the same semaphore), then do_page_fault will deadlock. Most architectures seem to be exposed to this. To avoid it, make sure the page is available. If not, release the semaphore, fault it in and retry. I also found another exposure by inspection, moving some of the code around avoids the possible deadlock there. Signed-off-by: Olof Johansson <[EMAIL PROTECTED]> Index: linux-2.5/kernel/futex.c === --- linux-2.5.orig/kernel/futex.c 2005-02-21 16:09:38.0 -0600 +++ linux-2.5/kernel/futex.c2005-02-22 16:38:24.0 -0600 @@ -329,6 +329,7 @@ int ret, drop_count = 0; unsigned int nqueued; + retry: down_read(>mm->mmap_sem); ret = get_futex_key(uaddr1, ); @@ -355,9 +356,19 @@ before *uaddr1. */ smp_mb(); - if (get_user(curval, (int __user *)uaddr1) != 0) { - ret = -EFAULT; - goto out; + inc_preempt_count(); + ret = get_user(curval, (int __user *)uaddr1); + dec_preempt_count(); + + if (unlikely(ret)) { + up_read(>mm->mmap_sem); + /* Re-do the access outside the lock */ + ret = get_user(curval, (int __user *)uaddr1); + + if (!ret) + goto retry; + + return ret; } if (curval != *valp) { ret = -EAGAIN; @@ -480,6 +491,7 @@ int ret, curval; struct futex_q q; + retry: down_read(>mm->mmap_sem); ret = get_futex_key(uaddr, ); @@ -508,9 +520,21 @@ * We hold the mmap semaphore, so the mapping cannot have changed * since we looked it up in get_futex_key. */ - if (get_user(curval, (int __user *)uaddr) != 0) { - ret = -EFAULT; - goto out_unqueue; + inc_preempt_count(); + ret = get_user(curval, (int __user *)uaddr); + dec_preempt_count(); + if (unlikely(ret)) { + up_read(>mm->mmap_sem); + + if (!unqueue_me()) /* There's a chance we got woken already */ + return 0; + + /* Re-do the access outside the lock */ + ret = get_user(curval, (int __user *)uaddr); + + if (!ret) + goto retry; + return ret; } if (curval != val) { ret = -EWOULDBLOCK; Index: linux-2.5/mm/mempolicy.c === --- linux-2.5.orig/mm/mempolicy.c 2005-02-04 00:27:40.0 -0600 +++ linux-2.5/mm/mempolicy.c2005-02-22 14:34:19.0 -0600 @@ -524,9 +524,13 @@ } else pval = pol->policy; - err = -EFAULT; + if (vma) { + up_read(>mm->mmap_sem); + vma = NULL; + } + if (policy && put_user(pval, policy)) - goto out; + return -EFAULT; err = 0; if (nmask) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Linus Torvalds wrote: > > queue_me(...) etc. > > current->flags |= PF_MMAP_SEM; <- new > > ret = get_user(...); > > current->flags &= PF_MMAP_SEM; <- new > > /* the rest */ > > That is uglee. > > We really have this already, and it's called "current->preempt". It > handles any lock at all, and doesn't add yet another special case to all > the architectures. Ooh, I didn't know current->preempt did that (been away). > repeat: > down_read(>mm->mmap_sem); > get_futex_key(...) etc. > queue_me(...) etc. > inc_preempt_count(); > ret = get_user(...); > dec_preempt_count(); > if (unlikely(ret)) { > up_read(>mm->mmap_sem); > /* Re-do the access outside the lock */ > ret = get_user(...); > if (!ret) > goto repeat; > return ret; > } That would work. I like it. :) Page faults will enter the fault handler twice (i.e. slower), but that's not really a disadvantage, because a program always references the memory just before calling futex_wait anyway. A fault is rare. There is one small but important error: the "return ret" mustn't just return. It must call unqueue_me() just like the code at out_unqueue, _including_ the conditional "ret = 0", but _excluding_ the up_read(). Alternatively, since it's a rare case, just shuffle the loop around: down_read(>mm->mmap_sem); repeat: get_futex_key(...) etc. queue_me(...) etc. inc_preempt_count(); ret = get_user(...); dec_preempt_count(); if (unlikely(ret)) { up_read(>mm->mmap_sem); /* Re-do the access outside the lock */ ret = get_user(...); down_read(>mm->mmap_sem); if (!ret) goto repeat; goto out_unqueue; } -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, 2005-02-22 at 14:10 -0800, Linus Torvalds wrote: > Oh, well. The reason I hate the rwsem behaviour is exactly because it > results in this very subtle class of deadlocks. This one case is certainly > solvable several ways, but do we have other issues somewhere else? Things > like kobject might be ripe with things like this. The mm semaphore tends > to be pretty well-behaved - and I'm not sure the same is true of the > kobject one. We could detect those tho. When the appropriate DEBUG option is set, by storing a cpumask in the semaphore we could detect if it's already taken on this cpu... > Normal recursive deadlocks are wonderful - most of them show up > immediately, so assuming you just have enough coverage, you're fine. This > fairness-related deadlock requires a race to happen. Unless you consider that taking the read semaphore twice on the same CPU is always a bug, thus the above stuff would work for catching them at least more often... > Maybe it would be sufficient to have a debugging version of rwsems that > just notice recursion? > > Linus -- Benjamin Herrenschmidt <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Olof Johansson wrote: > > That won't work because the vma lock must be help between key > > calculation and get_user() - otherwise futex is not reliable. It > > would work if the futex key calculation was inside the loop. > > Sure, but that's still true: It's just that the get_user() is done twice > instead. The semaphore is never released between the key calculation and > the "real" get_user(). Ah, I didn't look at where the loop is used and didn't think there'd be _two_ get_user() calls in the fast case. Not my instinct. > > A much simpler solution (and sorry for not offering it earlier, > > because Andrew Morton pointed out this bug long ago, but I was busy), is: > > Either way works for me. Andrew/Linus, got a preference? I'll either > post my refresh based on Andrews comments, or code up Jamie's > suggestion. Yours has a couple of problems. 1. It'll make futex waits somewhat slower. One of the nicer features of 2.6 futexes is that we got rid of the explicit page table lookup. 2. It's broken because a page can be paged out by another thread after you've forced it in and before the get_user(). We only take mmap_sem, not the page table lock. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote: > > Yours is probably the most efficient too. Note sure what is best for > rwsems tho, there seem to be some interest preventing readers from > starving writers for ever, this has been debated endlessly iirc, > though I have no personal opinion there. Yes, the starvation issue is potentially real. And thinking about it, we've even had that in real life, with /proc and lots of page faults. So I guess that's a strong argument for the fairness thing. Oh, well. The reason I hate the rwsem behaviour is exactly because it results in this very subtle class of deadlocks. This one case is certainly solvable several ways, but do we have other issues somewhere else? Things like kobject might be ripe with things like this. The mm semaphore tends to be pretty well-behaved - and I'm not sure the same is true of the kobject one. Normal recursive deadlocks are wonderful - most of them show up immediately, so assuming you just have enough coverage, you're fine. This fairness-related deadlock requires a race to happen. Maybe it would be sufficient to have a debugging version of rwsems that just notice recursion? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, 22 Feb 2005, Andrew Morton wrote: > > However the pte can get unmapped by memory reclaim so we could still take a > minor fault, and hit the same deadlock, yes? You _could_ fix that by getting the pagetable spinlock, I guess. Which check_user_page_readable() assumes you'd be holding anyway (not holding it would appear to be a bug). At which point you might as well just walk the tables by hand and just do the read that way. Of course, then you have virtual aliasing issues etc. Insane, but possible. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, 2005-02-22 at 13:31 -0800, Linus Torvalds wrote: > > On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote: > > > > Isn't Olof scheme racy ? Can't the stuff get swapped out between the > > first get_user() and the "real" one ? > > Yes. But see my suggested modification (which I still think is "the thing > that Olof does", except it's more efficient and avoids the race). > > If rwsems acted like rwlocks, we wouldn't have this issue at all. Yours is probably the most efficient too. Note sure what is best for rwsems tho, there seem to be some interest preventing readers from starving writers for ever, this has been debated endlessly iirc, though I have no personal opinion there. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Jamie Lokier <[EMAIL PROTECTED]> wrote: > > ... > > > > One attempt to fix this is included below. It works, but I'm not entirely > > > happy with the fact that it's a bit messy solution. If anyone has a > > > better idea for how to solve it I'd be all ears. > > > > It's fairly sane. Style-wise I'd be inclined to turn this: > > > > down_read(>mm->mmap_sem); > > while (!check_user_page_readable(current->mm, uaddr1)) { > > up_read(>mm->mmap_sem); > > /* Fault in the page through get_user() but discard result */ > > if (get_user(curval, (int __user *)uaddr1) != 0) > > return -EFAULT; > > down_read(>mm->mmap_sem); > > } > > That won't work because the vma lock must be help between key > calculation and get_user() - otherwise futex is not reliable. It > would work if the futex key calculation was inside the loop. All the above is trying to do is to convert the initial down_read(mmap_sem) into a function which, on exit, guarantees that a) down_read(mmap_sem) is held and b) the subsequent get_user() of that address will not generate a pagefault. So it shouldn't affect the futex code's atomicity at all. However the pte can get unmapped by memory reclaim so we could still take a minor fault, and hit the same deadlock, yes? > A much simpler solution (and sorry for not offering it earlier, > because Andrew Morton pointed out this bug long ago, but I was busy), is: > > In futex.c: > > down_read(>mm->mmap_sem); > get_futex_key(...) etc. > queue_me(...) etc. > current->flags |= PF_MMAP_SEM; <- new > ret = get_user(...); > current->flags &= PF_MMAP_SEM; <- new > /* the rest */ > > And in arch/*/mm/fault.c, replace every one of these: > > down_read(>mmap_sem); > > up_read(>mmap_sem); > > with these: > > if (!(current & PF_MMAP_SEM)) > down_read(>mmap_sem); > > if (!(current & PF_MMAP_SEM)) > up_read(>mmap_sem); > Yes, that will work. However I do feel that it's cleaner to localise this nastiness into a single function which the futex code calls, rather than spreading it all around and adding overhead to every pagefault. If we can work out how. wrt this down_read/down_write/down_read deadlock: iirc, the reason why down_write() takes precedence over down_read() is to avoid the permanent writer starvation which would occur if there is heavy down_read() traffic. As Linus points out, an alternative would be to do an inc_preempt_count() around the offending get_user(), then use __copy_from_user_inatomic(), then take some sort of remedial action if __copy_from_user_inatomic() returns a fault. Something like: retry: if (get_user(uaddr) == -EFAULT) return -EFAULT; down_read(mmap_sem); inc_preempt_count(); if (__copy_from_user_inatomic(..., uaddr)) { up_read(mmap_sem); dec_preempt_count(); goto retry; } dec_preempt_count(); up_read(mmap_sem); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, 22 Feb 2005, Jamie Lokier wrote: > > A much simpler solution (and sorry for not offering it earlier, > because Andrew Morton pointed out this bug long ago, but I was busy), is: > > In futex.c: > > down_read(>mm->mmap_sem); > get_futex_key(...) etc. > queue_me(...) etc. > current->flags |= PF_MMAP_SEM; <- new > ret = get_user(...); > current->flags &= PF_MMAP_SEM; <- new > /* the rest */ That is uglee. We really have this already, and it's called "current->preempt". It handles any lock at all, and doesn't add yet another special case to all the architectures. Just do repeat: down_read(>mm->mmap_sem); get_futex_key(...) etc. queue_me(...) etc. inc_preempt_count(); ret = get_user(...); dec_preempt_count(); if (unlikely(ret)) { up_read(>mm->mmap_sem); /* Re-do the access outside the lock */ ret = get_user(...); if (!ret) goto repeat; return ret; } ... and you should be ok. No new special cases, no new abstractions. At most, we should probably create a "get_user_inatomic()", to - make it damn obvious what we're doing, and match the explicit "inatomic" in the other place where we depend on this (fs/filemap.c) - allow the regular "get_user()" to continue to do the normal "might_sleep()" checks. That's assuming we can't just make rwsem's nest nicely. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote: > > Isn't Olof scheme racy ? Can't the stuff get swapped out between the > first get_user() and the "real" one ? Yes. But see my suggested modification (which I still think is "the thing that Olof does", except it's more efficient and avoids the race). If rwsems acted like rwlocks, we wouldn't have this issue at all. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Chris Friesen wrote: > > down_read(>mm->mmap_sem); > > get_futex_key(...) etc. > > queue_me(...) etc. > > current->flags |= PF_MMAP_SEM; <- new > > ret = get_user(...); > > current->flags &= PF_MMAP_SEM; <- new > > /* the rest */ > > Should the second new line be this (with the inverse)? > > current->flags &= ~PF_MMAP_SEM; Quiet! I was trying to sneak in a security hole! :) -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, Feb 22, 2005 at 09:07:52PM +, Jamie Lokier wrote: > > That won't work because the vma lock must be help between key > calculation and get_user() - otherwise futex is not reliable. It > would work if the futex key calculation was inside the loop. Sure, but that's still true: It's just that the get_user() is done twice instead. The semaphore is never released between the key calculation and the "real" get_user(). > A much simpler solution (and sorry for not offering it earlier, > because Andrew Morton pointed out this bug long ago, but I was busy), is: Either way works for me. Andrew/Linus, got a preference? I'll either post my refresh based on Andrews comments, or code up Jamie's suggestion. -Olof - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 2005-02-23 at 08:16 +1100, Benjamin Herrenschmidt wrote: > On Tue, 2005-02-22 at 11:36 -0800, Linus Torvalds wrote: > > > DavidH - what's the word on nested read-semaphores like this? Are they > > supposed to work (like nested read-spinlocks), or do we need to do the > > things Olof does? > > Isn't Olof scheme racy ? Can't the stuff get swapped out between the > first get_user() and the "real" one ? Forget it, I missed the check_user_page_readable() guy within the semaphore protection. I didn't know that function ;) Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Jamie Lokier wrote: In futex.c: down_read(>mm->mmap_sem); get_futex_key(...) etc. queue_me(...) etc. current->flags |= PF_MMAP_SEM; <- new ret = get_user(...); current->flags &= PF_MMAP_SEM; <- new /* the rest */ Should the second new line be this (with the inverse)? current->flags &= ~PF_MMAP_SEM; Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, 2005-02-22 at 11:36 -0800, Linus Torvalds wrote: > DavidH - what's the word on nested read-semaphores like this? Are they > supposed to work (like nested read-spinlocks), or do we need to do the > things Olof does? Isn't Olof scheme racy ? Can't the stuff get swapped out between the first get_user() and the "real" one ? Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Andrew Morton wrote: > > This will quickly lock up, since the futex_wait code dows a > > down_read(mmap_sem), then a get_user(). > > > > The do_page_fault code on ppc64 (as well as other architectures) needs > > to take the same semaphore for reading. This is all good until the > > second thread comes into play: Its mmap call tries to take the same > > semaphore for writing which causes in the do_page_fault down_read() > > to get stuck. Classic deadlock. > > Yup. Jamie says that the futex code _has_ to hold mmap_sem across the > get_user(). I forget (but could probably locate) the details. It does - the "key" which identifies a futex depends on a vma calculation, and the vma must not change between the calculation and the get_user(). > > One attempt to fix this is included below. It works, but I'm not entirely > > happy with the fact that it's a bit messy solution. If anyone has a > > better idea for how to solve it I'd be all ears. > > It's fairly sane. Style-wise I'd be inclined to turn this: > > down_read(>mm->mmap_sem); > while (!check_user_page_readable(current->mm, uaddr1)) { > up_read(>mm->mmap_sem); > /* Fault in the page through get_user() but discard result */ > if (get_user(curval, (int __user *)uaddr1) != 0) > return -EFAULT; > down_read(>mm->mmap_sem); > } That won't work because the vma lock must be help between key calculation and get_user() - otherwise futex is not reliable. It would work if the futex key calculation was inside the loop. A much simpler solution (and sorry for not offering it earlier, because Andrew Morton pointed out this bug long ago, but I was busy), is: In futex.c: down_read(>mm->mmap_sem); get_futex_key(...) etc. queue_me(...) etc. current->flags |= PF_MMAP_SEM; <- new ret = get_user(...); current->flags &= PF_MMAP_SEM; <- new /* the rest */ And in arch/*/mm/fault.c, replace every one of these: down_read(>mmap_sem); up_read(>mmap_sem); with these: if (!(current & PF_MMAP_SEM)) down_read(>mmap_sem); if (!(current & PF_MMAP_SEM)) up_read(>mmap_sem); -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
[EMAIL PROTECTED] (Olof Johansson) wrote: > > Hi, > > Consider a small testcase that spawns off two threads, either thread > doing a loop of: > > buf = mmap /dev/zero MAP_SHARED for 0x10 bytes > call sys_futex (buf+page, FUTEX_WAIT, 1, NULL, NULL) for each page in > said mmap > munmap(buf) > repeat > > This will quickly lock up, since the futex_wait code dows a > down_read(mmap_sem), then a get_user(). > > The do_page_fault code on ppc64 (as well as other architectures) needs > to take the same semaphore for reading. This is all good until the > second thread comes into play: Its mmap call tries to take the same > semaphore for writing which causes in the do_page_fault down_read() > to get stuck. Classic deadlock. Yup. Jamie says that the futex code _has_ to hold mmap_sem across the get_user(). I forget (but could probably locate) the details. > > One attempt to fix this is included below. It works, but I'm not entirely > happy with the fact that it's a bit messy solution. If anyone has a > better idea for how to solve it I'd be all ears. It's fairly sane. Style-wise I'd be inclined to turn this: down_read(>mm->mmap_sem); while (!check_user_page_readable(current->mm, uaddr1)) { up_read(>mm->mmap_sem); /* Fault in the page through get_user() but discard result */ if (get_user(curval, (int __user *)uaddr1) != 0) return -EFAULT; down_read(>mm->mmap_sem); } into a standalone helper function. > --- linux-2.5.orig/mm/mempolicy.c 2005-02-04 00:27:40.0 -0600 > +++ linux-2.5/mm/mempolicy.c 2005-02-21 16:43:08.0 -0600 > @@ -486,6 +486,7 @@ > struct mm_struct *mm = current->mm; > struct vm_area_struct *vma = NULL; > struct mempolicy *pol = current->mempolicy; > + DECLARE_BITMAP(nodes, MAX_NUMNODES); > > if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR)) > return -EINVAL; > @@ -524,16 +525,21 @@ > } else > pval = pol->policy; > > + if (nmask) > + get_zonemask(pol, nodes); > + > + if (vma) { > + up_read(>mm->mmap_sem); > + vma = NULL; > + } > + OK. > err = -EFAULT; > if (policy && put_user(pval, policy)) > goto out; > > err = 0; > - if (nmask) { > - DECLARE_BITMAP(nodes, MAX_NUMNODES); > - get_zonemask(pol, nodes); > + if (nmask) > err = copy_nodes_to_user(nmask, maxnode, nodes, sizeof(nodes)); > - } > > out: > if (vma) I don't think we need to hold mmap_sem while running get_zonemask(). `pol' is a copy of current->mempolicy, and it won't be going away. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, 22 Feb 2005, Olof Johansson wrote: > > Consider a small testcase that spawns off two threads, either thread > doing a loop of: > > buf = mmap /dev/zero MAP_SHARED for 0x10 bytes > call sys_futex (buf+page, FUTEX_WAIT, 1, NULL, NULL) for each page in > said mmap > munmap(buf) > repeat > > This will quickly lock up, since the futex_wait code dows a > down_read(mmap_sem), then a get_user(). > > The do_page_fault code on ppc64 (as well as other architectures) needs > to take the same semaphore for reading. This is all good until the > second thread comes into play: Its mmap call tries to take the same > semaphore for writing which causes in the do_page_fault down_read() > to get stuck. Classic deadlock. It shouldn't be. If one read writer is active, another should be able to come in, regardless of any pending writer trying to access it. At least that's always been the rule for the rw-spinlocks _exactly_ for this reaseon. The rwsem code tries to be fairer, maybe it has broken this case in the name of fairness. I personally think fairness is overrated, and would rather have the rwsem implementation let readers come in. But if we really don't want that, then: > One attempt to fix this is included below. It works, but I'm not entirely > happy with the fact that it's a bit messy solution. If anyone has a > better idea for how to solve it I'd be all ears. We have a notion of "atomic get_user()" calls, which is a lot more efficient, and is already used for other cases where we have _real_ deadlocks (inode semaphore for reads into shared memory mappigns). See "__copy_to_user_inatomic()", and in particular note that it effectively _is_ legal to do user accesses with the preempt-count elevated or interrupts disabled to tell the page fault handler not to delay (then you have to check the error return, and if it returned an error you do it the slow way - drop all locks and do a non-atomic access). > Auditing other read-takers of mmap_sem, I found one more exposure that > could be solved by just moving code around. DavidH - what's the word on nested read-semaphores like this? Are they supposed to work (like nested read-spinlocks), or do we need to do the things Olof does? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, 22 Feb 2005, Olof Johansson wrote: Consider a small testcase that spawns off two threads, either thread doing a loop of: buf = mmap /dev/zero MAP_SHARED for 0x10 bytes call sys_futex (buf+page, FUTEX_WAIT, 1, NULL, NULL) for each page in said mmap munmap(buf) repeat This will quickly lock up, since the futex_wait code dows a down_read(mmap_sem), then a get_user(). The do_page_fault code on ppc64 (as well as other architectures) needs to take the same semaphore for reading. This is all good until the second thread comes into play: Its mmap call tries to take the same semaphore for writing which causes in the do_page_fault down_read() to get stuck. Classic deadlock. It shouldn't be. If one read writer is active, another should be able to come in, regardless of any pending writer trying to access it. At least that's always been the rule for the rw-spinlocks _exactly_ for this reaseon. The rwsem code tries to be fairer, maybe it has broken this case in the name of fairness. I personally think fairness is overrated, and would rather have the rwsem implementation let readers come in. But if we really don't want that, then: One attempt to fix this is included below. It works, but I'm not entirely happy with the fact that it's a bit messy solution. If anyone has a better idea for how to solve it I'd be all ears. We have a notion of atomic get_user() calls, which is a lot more efficient, and is already used for other cases where we have _real_ deadlocks (inode semaphore for reads into shared memory mappigns). See __copy_to_user_inatomic(), and in particular note that it effectively _is_ legal to do user accesses with the preempt-count elevated or interrupts disabled to tell the page fault handler not to delay (then you have to check the error return, and if it returned an error you do it the slow way - drop all locks and do a non-atomic access). Auditing other read-takers of mmap_sem, I found one more exposure that could be solved by just moving code around. DavidH - what's the word on nested read-semaphores like this? Are they supposed to work (like nested read-spinlocks), or do we need to do the things Olof does? Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
[EMAIL PROTECTED] (Olof Johansson) wrote: Hi, Consider a small testcase that spawns off two threads, either thread doing a loop of: buf = mmap /dev/zero MAP_SHARED for 0x10 bytes call sys_futex (buf+page, FUTEX_WAIT, 1, NULL, NULL) for each page in said mmap munmap(buf) repeat This will quickly lock up, since the futex_wait code dows a down_read(mmap_sem), then a get_user(). The do_page_fault code on ppc64 (as well as other architectures) needs to take the same semaphore for reading. This is all good until the second thread comes into play: Its mmap call tries to take the same semaphore for writing which causes in the do_page_fault down_read() to get stuck. Classic deadlock. Yup. Jamie says that the futex code _has_ to hold mmap_sem across the get_user(). I forget (but could probably locate) the details. One attempt to fix this is included below. It works, but I'm not entirely happy with the fact that it's a bit messy solution. If anyone has a better idea for how to solve it I'd be all ears. It's fairly sane. Style-wise I'd be inclined to turn this: down_read(current-mm-mmap_sem); while (!check_user_page_readable(current-mm, uaddr1)) { up_read(current-mm-mmap_sem); /* Fault in the page through get_user() but discard result */ if (get_user(curval, (int __user *)uaddr1) != 0) return -EFAULT; down_read(current-mm-mmap_sem); } into a standalone helper function. --- linux-2.5.orig/mm/mempolicy.c 2005-02-04 00:27:40.0 -0600 +++ linux-2.5/mm/mempolicy.c 2005-02-21 16:43:08.0 -0600 @@ -486,6 +486,7 @@ struct mm_struct *mm = current-mm; struct vm_area_struct *vma = NULL; struct mempolicy *pol = current-mempolicy; + DECLARE_BITMAP(nodes, MAX_NUMNODES); if (flags ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR)) return -EINVAL; @@ -524,16 +525,21 @@ } else pval = pol-policy; + if (nmask) + get_zonemask(pol, nodes); + + if (vma) { + up_read(current-mm-mmap_sem); + vma = NULL; + } + OK. err = -EFAULT; if (policy put_user(pval, policy)) goto out; err = 0; - if (nmask) { - DECLARE_BITMAP(nodes, MAX_NUMNODES); - get_zonemask(pol, nodes); + if (nmask) err = copy_nodes_to_user(nmask, maxnode, nodes, sizeof(nodes)); - } out: if (vma) I don't think we need to hold mmap_sem while running get_zonemask(). `pol' is a copy of current-mempolicy, and it won't be going away. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Andrew Morton wrote: This will quickly lock up, since the futex_wait code dows a down_read(mmap_sem), then a get_user(). The do_page_fault code on ppc64 (as well as other architectures) needs to take the same semaphore for reading. This is all good until the second thread comes into play: Its mmap call tries to take the same semaphore for writing which causes in the do_page_fault down_read() to get stuck. Classic deadlock. Yup. Jamie says that the futex code _has_ to hold mmap_sem across the get_user(). I forget (but could probably locate) the details. It does - the key which identifies a futex depends on a vma calculation, and the vma must not change between the calculation and the get_user(). One attempt to fix this is included below. It works, but I'm not entirely happy with the fact that it's a bit messy solution. If anyone has a better idea for how to solve it I'd be all ears. It's fairly sane. Style-wise I'd be inclined to turn this: down_read(current-mm-mmap_sem); while (!check_user_page_readable(current-mm, uaddr1)) { up_read(current-mm-mmap_sem); /* Fault in the page through get_user() but discard result */ if (get_user(curval, (int __user *)uaddr1) != 0) return -EFAULT; down_read(current-mm-mmap_sem); } That won't work because the vma lock must be help between key calculation and get_user() - otherwise futex is not reliable. It would work if the futex key calculation was inside the loop. A much simpler solution (and sorry for not offering it earlier, because Andrew Morton pointed out this bug long ago, but I was busy), is: In futex.c: down_read(current-mm-mmap_sem); get_futex_key(...) etc. queue_me(...) etc. current-flags |= PF_MMAP_SEM; - new ret = get_user(...); current-flags = PF_MMAP_SEM; - new /* the rest */ And in arch/*/mm/fault.c, replace every one of these: down_read(mm-mmap_sem); up_read(mm-mmap_sem); with these: if (!(current PF_MMAP_SEM)) down_read(mm-mmap_sem); if (!(current PF_MMAP_SEM)) up_read(mm-mmap_sem); -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, 2005-02-22 at 11:36 -0800, Linus Torvalds wrote: DavidH - what's the word on nested read-semaphores like this? Are they supposed to work (like nested read-spinlocks), or do we need to do the things Olof does? Isn't Olof scheme racy ? Can't the stuff get swapped out between the first get_user() and the real one ? Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Jamie Lokier wrote: In futex.c: down_read(current-mm-mmap_sem); get_futex_key(...) etc. queue_me(...) etc. current-flags |= PF_MMAP_SEM; - new ret = get_user(...); current-flags = PF_MMAP_SEM; - new /* the rest */ Should the second new line be this (with the inverse)? current-flags = ~PF_MMAP_SEM; Chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 2005-02-23 at 08:16 +1100, Benjamin Herrenschmidt wrote: On Tue, 2005-02-22 at 11:36 -0800, Linus Torvalds wrote: DavidH - what's the word on nested read-semaphores like this? Are they supposed to work (like nested read-spinlocks), or do we need to do the things Olof does? Isn't Olof scheme racy ? Can't the stuff get swapped out between the first get_user() and the real one ? Forget it, I missed the check_user_page_readable() guy within the semaphore protection. I didn't know that function ;) Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, Feb 22, 2005 at 09:07:52PM +, Jamie Lokier wrote: That won't work because the vma lock must be help between key calculation and get_user() - otherwise futex is not reliable. It would work if the futex key calculation was inside the loop. Sure, but that's still true: It's just that the get_user() is done twice instead. The semaphore is never released between the key calculation and the real get_user(). A much simpler solution (and sorry for not offering it earlier, because Andrew Morton pointed out this bug long ago, but I was busy), is: Either way works for me. Andrew/Linus, got a preference? I'll either post my refresh based on Andrews comments, or code up Jamie's suggestion. -Olof - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Chris Friesen wrote: down_read(current-mm-mmap_sem); get_futex_key(...) etc. queue_me(...) etc. current-flags |= PF_MMAP_SEM; - new ret = get_user(...); current-flags = PF_MMAP_SEM; - new /* the rest */ Should the second new line be this (with the inverse)? current-flags = ~PF_MMAP_SEM; Quiet! I was trying to sneak in a security hole! :) -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote: Isn't Olof scheme racy ? Can't the stuff get swapped out between the first get_user() and the real one ? Yes. But see my suggested modification (which I still think is the thing that Olof does, except it's more efficient and avoids the race). If rwsems acted like rwlocks, we wouldn't have this issue at all. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, 22 Feb 2005, Jamie Lokier wrote: A much simpler solution (and sorry for not offering it earlier, because Andrew Morton pointed out this bug long ago, but I was busy), is: In futex.c: down_read(current-mm-mmap_sem); get_futex_key(...) etc. queue_me(...) etc. current-flags |= PF_MMAP_SEM; - new ret = get_user(...); current-flags = PF_MMAP_SEM; - new /* the rest */ That is uglee. We really have this already, and it's called current-preempt. It handles any lock at all, and doesn't add yet another special case to all the architectures. Just do repeat: down_read(current-mm-mmap_sem); get_futex_key(...) etc. queue_me(...) etc. inc_preempt_count(); ret = get_user(...); dec_preempt_count(); if (unlikely(ret)) { up_read(current-mm-mmap_sem); /* Re-do the access outside the lock */ ret = get_user(...); if (!ret) goto repeat; return ret; } ... and you should be ok. No new special cases, no new abstractions. At most, we should probably create a get_user_inatomic(), to - make it damn obvious what we're doing, and match the explicit inatomic in the other place where we depend on this (fs/filemap.c) - allow the regular get_user() to continue to do the normal might_sleep() checks. That's assuming we can't just make rwsem's nest nicely. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Jamie Lokier [EMAIL PROTECTED] wrote: ... One attempt to fix this is included below. It works, but I'm not entirely happy with the fact that it's a bit messy solution. If anyone has a better idea for how to solve it I'd be all ears. It's fairly sane. Style-wise I'd be inclined to turn this: down_read(current-mm-mmap_sem); while (!check_user_page_readable(current-mm, uaddr1)) { up_read(current-mm-mmap_sem); /* Fault in the page through get_user() but discard result */ if (get_user(curval, (int __user *)uaddr1) != 0) return -EFAULT; down_read(current-mm-mmap_sem); } That won't work because the vma lock must be help between key calculation and get_user() - otherwise futex is not reliable. It would work if the futex key calculation was inside the loop. All the above is trying to do is to convert the initial down_read(mmap_sem) into a function which, on exit, guarantees that a) down_read(mmap_sem) is held and b) the subsequent get_user() of that address will not generate a pagefault. So it shouldn't affect the futex code's atomicity at all. However the pte can get unmapped by memory reclaim so we could still take a minor fault, and hit the same deadlock, yes? A much simpler solution (and sorry for not offering it earlier, because Andrew Morton pointed out this bug long ago, but I was busy), is: In futex.c: down_read(current-mm-mmap_sem); get_futex_key(...) etc. queue_me(...) etc. current-flags |= PF_MMAP_SEM; - new ret = get_user(...); current-flags = PF_MMAP_SEM; - new /* the rest */ And in arch/*/mm/fault.c, replace every one of these: down_read(mm-mmap_sem); up_read(mm-mmap_sem); with these: if (!(current PF_MMAP_SEM)) down_read(mm-mmap_sem); if (!(current PF_MMAP_SEM)) up_read(mm-mmap_sem); Yes, that will work. However I do feel that it's cleaner to localise this nastiness into a single function which the futex code calls, rather than spreading it all around and adding overhead to every pagefault. If we can work out how. wrt this down_read/down_write/down_read deadlock: iirc, the reason why down_write() takes precedence over down_read() is to avoid the permanent writer starvation which would occur if there is heavy down_read() traffic. As Linus points out, an alternative would be to do an inc_preempt_count() around the offending get_user(), then use __copy_from_user_inatomic(), then take some sort of remedial action if __copy_from_user_inatomic() returns a fault. Something like: retry: if (get_user(uaddr) == -EFAULT) return -EFAULT; down_read(mmap_sem); inc_preempt_count(); if (__copy_from_user_inatomic(..., uaddr)) { up_read(mmap_sem); dec_preempt_count(); goto retry; } dec_preempt_count(); up_read(mmap_sem); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, 2005-02-22 at 13:31 -0800, Linus Torvalds wrote: On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote: Isn't Olof scheme racy ? Can't the stuff get swapped out between the first get_user() and the real one ? Yes. But see my suggested modification (which I still think is the thing that Olof does, except it's more efficient and avoids the race). If rwsems acted like rwlocks, we wouldn't have this issue at all. Yours is probably the most efficient too. Note sure what is best for rwsems tho, there seem to be some interest preventing readers from starving writers for ever, this has been debated endlessly iirc, though I have no personal opinion there. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, 22 Feb 2005, Andrew Morton wrote: However the pte can get unmapped by memory reclaim so we could still take a minor fault, and hit the same deadlock, yes? You _could_ fix that by getting the pagetable spinlock, I guess. Which check_user_page_readable() assumes you'd be holding anyway (not holding it would appear to be a bug). At which point you might as well just walk the tables by hand and just do the read that way. Of course, then you have virtual aliasing issues etc. Insane, but possible. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Olof Johansson wrote: That won't work because the vma lock must be help between key calculation and get_user() - otherwise futex is not reliable. It would work if the futex key calculation was inside the loop. Sure, but that's still true: It's just that the get_user() is done twice instead. The semaphore is never released between the key calculation and the real get_user(). Ah, I didn't look at where the loop is used and didn't think there'd be _two_ get_user() calls in the fast case. Not my instinct. A much simpler solution (and sorry for not offering it earlier, because Andrew Morton pointed out this bug long ago, but I was busy), is: Either way works for me. Andrew/Linus, got a preference? I'll either post my refresh based on Andrews comments, or code up Jamie's suggestion. Yours has a couple of problems. 1. It'll make futex waits somewhat slower. One of the nicer features of 2.6 futexes is that we got rid of the explicit page table lookup. 2. It's broken because a page can be paged out by another thread after you've forced it in and before the get_user(). We only take mmap_sem, not the page table lock. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Wed, 23 Feb 2005, Benjamin Herrenschmidt wrote: Yours is probably the most efficient too. Note sure what is best for rwsems tho, there seem to be some interest preventing readers from starving writers for ever, this has been debated endlessly iirc, though I have no personal opinion there. Yes, the starvation issue is potentially real. And thinking about it, we've even had that in real life, with /proc and lots of page faults. So I guess that's a strong argument for the fairness thing. Oh, well. The reason I hate the rwsem behaviour is exactly because it results in this very subtle class of deadlocks. This one case is certainly solvable several ways, but do we have other issues somewhere else? Things like kobject might be ripe with things like this. The mm semaphore tends to be pretty well-behaved - and I'm not sure the same is true of the kobject one. Normal recursive deadlocks are wonderful - most of them show up immediately, so assuming you just have enough coverage, you're fine. This fairness-related deadlock requires a race to happen. Maybe it would be sufficient to have a debugging version of rwsems that just notice recursion? Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, 2005-02-22 at 14:10 -0800, Linus Torvalds wrote: Oh, well. The reason I hate the rwsem behaviour is exactly because it results in this very subtle class of deadlocks. This one case is certainly solvable several ways, but do we have other issues somewhere else? Things like kobject might be ripe with things like this. The mm semaphore tends to be pretty well-behaved - and I'm not sure the same is true of the kobject one. We could detect those tho. When the appropriate DEBUG option is set, by storing a cpumask in the semaphore we could detect if it's already taken on this cpu... Normal recursive deadlocks are wonderful - most of them show up immediately, so assuming you just have enough coverage, you're fine. This fairness-related deadlock requires a race to happen. Unless you consider that taking the read semaphore twice on the same CPU is always a bug, thus the above stuff would work for catching them at least more often... Maybe it would be sufficient to have a debugging version of rwsems that just notice recursion? Linus -- Benjamin Herrenschmidt [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Linus Torvalds wrote: queue_me(...) etc. current-flags |= PF_MMAP_SEM; - new ret = get_user(...); current-flags = PF_MMAP_SEM; - new /* the rest */ That is uglee. We really have this already, and it's called current-preempt. It handles any lock at all, and doesn't add yet another special case to all the architectures. Ooh, I didn't know current-preempt did that (been away). repeat: down_read(current-mm-mmap_sem); get_futex_key(...) etc. queue_me(...) etc. inc_preempt_count(); ret = get_user(...); dec_preempt_count(); if (unlikely(ret)) { up_read(current-mm-mmap_sem); /* Re-do the access outside the lock */ ret = get_user(...); if (!ret) goto repeat; return ret; } That would work. I like it. :) Page faults will enter the fault handler twice (i.e. slower), but that's not really a disadvantage, because a program always references the memory just before calling futex_wait anyway. A fault is rare. There is one small but important error: the return ret mustn't just return. It must call unqueue_me(q) just like the code at out_unqueue, _including_ the conditional ret = 0, but _excluding_ the up_read(). Alternatively, since it's a rare case, just shuffle the loop around: down_read(current-mm-mmap_sem); repeat: get_futex_key(...) etc. queue_me(...) etc. inc_preempt_count(); ret = get_user(...); dec_preempt_count(); if (unlikely(ret)) { up_read(current-mm-mmap_sem); /* Re-do the access outside the lock */ ret = get_user(...); down_read(current-mm-mmap_sem); if (!ret) goto repeat; goto out_unqueue; } -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, Feb 22, 2005 at 10:34:57PM +, Jamie Lokier wrote: There is one small but important error: the return ret mustn't just return. It must call unqueue_me(q) just like the code at out_unqueue, _including_ the conditional ret = 0, but _excluding_ the up_read(). Not only that, but someone might already have dequeued us, right? It's probably a pathological case though, i.e. someone did a wake on the same (bad) address. How's this patch? It's closer to Linus' pseudo-code than Andrew's, to avoid the extra get_user() at function entry and keep the common case path short. It also includes the feedback from Andrew on the sys_get_mempolicy(), making the patch even simpler there. Some futex functions do get_user calls while holding mmap_sem for reading. If get_user() faults, and another thread happens to be in mmap (or somewhere else holding waiting on down_write for the same semaphore), then do_page_fault will deadlock. Most architectures seem to be exposed to this. To avoid it, make sure the page is available. If not, release the semaphore, fault it in and retry. I also found another exposure by inspection, moving some of the code around avoids the possible deadlock there. Signed-off-by: Olof Johansson [EMAIL PROTECTED] Index: linux-2.5/kernel/futex.c === --- linux-2.5.orig/kernel/futex.c 2005-02-21 16:09:38.0 -0600 +++ linux-2.5/kernel/futex.c2005-02-22 16:38:24.0 -0600 @@ -329,6 +329,7 @@ int ret, drop_count = 0; unsigned int nqueued; + retry: down_read(current-mm-mmap_sem); ret = get_futex_key(uaddr1, key1); @@ -355,9 +356,19 @@ before *uaddr1. */ smp_mb(); - if (get_user(curval, (int __user *)uaddr1) != 0) { - ret = -EFAULT; - goto out; + inc_preempt_count(); + ret = get_user(curval, (int __user *)uaddr1); + dec_preempt_count(); + + if (unlikely(ret)) { + up_read(current-mm-mmap_sem); + /* Re-do the access outside the lock */ + ret = get_user(curval, (int __user *)uaddr1); + + if (!ret) + goto retry; + + return ret; } if (curval != *valp) { ret = -EAGAIN; @@ -480,6 +491,7 @@ int ret, curval; struct futex_q q; + retry: down_read(current-mm-mmap_sem); ret = get_futex_key(uaddr, q.key); @@ -508,9 +520,21 @@ * We hold the mmap semaphore, so the mapping cannot have changed * since we looked it up in get_futex_key. */ - if (get_user(curval, (int __user *)uaddr) != 0) { - ret = -EFAULT; - goto out_unqueue; + inc_preempt_count(); + ret = get_user(curval, (int __user *)uaddr); + dec_preempt_count(); + if (unlikely(ret)) { + up_read(current-mm-mmap_sem); + + if (!unqueue_me(q)) /* There's a chance we got woken already */ + return 0; + + /* Re-do the access outside the lock */ + ret = get_user(curval, (int __user *)uaddr); + + if (!ret) + goto retry; + return ret; } if (curval != val) { ret = -EWOULDBLOCK; Index: linux-2.5/mm/mempolicy.c === --- linux-2.5.orig/mm/mempolicy.c 2005-02-04 00:27:40.0 -0600 +++ linux-2.5/mm/mempolicy.c2005-02-22 14:34:19.0 -0600 @@ -524,9 +524,13 @@ } else pval = pol-policy; - err = -EFAULT; + if (vma) { + up_read(current-mm-mmap_sem); + vma = NULL; + } + if (policy put_user(pval, policy)) - goto out; + return -EFAULT; err = 0; if (nmask) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, Feb 22, 2005 at 02:10:58PM -0800, Linus Torvalds wrote: Oh, well. The reason I hate the rwsem behaviour is exactly because it results in this very subtle class of deadlocks. This one case is certainly solvable several ways, but do we have other issues somewhere else? Things like kobject might be ripe with things like this. The mm semaphore tends to be pretty well-behaved - and I'm not sure the same is true of the kobject one. I'm trying to get rid of the kobject (actually the subsystem) rwsem right now, so it should be gone completly within a few kernel versions. thanks, greg k-h - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
[EMAIL PROTECTED] (Olof Johansson) wrote: + inc_preempt_count(); + ret = get_user(curval, (int __user *)uaddr1); + dec_preempt_count(); That _should_ generate a might_sleep() warning, except it looks like we forgot to add a check to get_user(). It would be better to use __copy_from_user_inatomic() here, I think. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
On Tue, Feb 22, 2005 at 03:20:27PM -0800, Andrew Morton wrote: [EMAIL PROTECTED] (Olof Johansson) wrote: + inc_preempt_count(); + ret = get_user(curval, (int __user *)uaddr1); + dec_preempt_count(); That _should_ generate a might_sleep() warning, except it looks like we forgot to add a check to get_user(). It would be better to use __copy_from_user_inatomic() here, I think. Thanks for catching that. New rev below. - Some futex functions do get_user calls while holding mmap_sem for reading. If get_user() faults, and another thread happens to be in mmap (or somewhere else holding waiting on down_write for the same semaphore), then do_page_fault will deadlock. Most architectures seem to be exposed to this. To avoid it, make sure the page is available. If not, release the semaphore, fault it in and retry. I also found another exposure by inspection, moving some of the code around avoids the possible deadlock there. Signed-off-by: Olof Johansson [EMAIL PROTECTED] Index: linux-2.5/kernel/futex.c === --- linux-2.5.orig/kernel/futex.c 2005-02-21 16:09:38.0 -0600 +++ linux-2.5/kernel/futex.c2005-02-22 17:20:22.0 -0600 @@ -329,6 +329,7 @@ int ret, drop_count = 0; unsigned int nqueued; + retry: down_read(current-mm-mmap_sem); ret = get_futex_key(uaddr1, key1); @@ -355,9 +356,19 @@ before *uaddr1. */ smp_mb(); - if (get_user(curval, (int __user *)uaddr1) != 0) { - ret = -EFAULT; - goto out; + inc_preempt_count(); + ret = __copy_from_user_inatomic(curval, (int __user *)uaddr1, sizeof(int)); + dec_preempt_count(); + + if (unlikely(ret)) { + up_read(current-mm-mmap_sem); + /* Re-do the access outside the lock */ + ret = get_user(curval, (int __user *)uaddr1); + + if (!ret) + goto retry; + + return ret; } if (curval != *valp) { ret = -EAGAIN; @@ -480,6 +491,7 @@ int ret, curval; struct futex_q q; + retry: down_read(current-mm-mmap_sem); ret = get_futex_key(uaddr, q.key); @@ -508,9 +520,21 @@ * We hold the mmap semaphore, so the mapping cannot have changed * since we looked it up in get_futex_key. */ - if (get_user(curval, (int __user *)uaddr) != 0) { - ret = -EFAULT; - goto out_unqueue; + inc_preempt_count(); + ret = __copy_from_user_inatomic(curval, (int __user *)uaddr, sizeof(int)); + dec_preempt_count(); + if (unlikely(ret)) { + up_read(current-mm-mmap_sem); + + if (!unqueue_me(q)) /* There's a chance we got woken already */ + return 0; + + /* Re-do the access outside the lock */ + ret = get_user(curval, (int __user *)uaddr); + + if (!ret) + goto retry; + return ret; } if (curval != val) { ret = -EWOULDBLOCK; Index: linux-2.5/mm/mempolicy.c === --- linux-2.5.orig/mm/mempolicy.c 2005-02-04 00:27:40.0 -0600 +++ linux-2.5/mm/mempolicy.c2005-02-22 14:34:19.0 -0600 @@ -524,9 +524,13 @@ } else pval = pol-policy; - err = -EFAULT; + if (vma) { + up_read(current-mm-mmap_sem); + vma = NULL; + } + if (policy put_user(pval, policy)) - goto out; + return -EFAULT; err = 0; if (nmask) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/