Re: Memory reserves or lack thereof
On 11/13/2012 05:54, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 05:10:01PM -0600, Alan Cox wrote: On 11/12/2012 3:48 PM, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote: This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. This is out of scope of the change. But it is required for any further adjustements. I would suggest a somewhat different response: The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and does reintroduce M_USE_RESERVE as a way to specify dig deep. Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory down to two pages. The effect of the patch is to stop M_NOWAIT at two pages rather than allowing it to continue to zero pages. When you say, This is out of scope ..., I believe that you are referring to changing two pages into something larger. I agree that this is out of scope for the current change. I referred exactly to the difference between M_USE_RESERVE set or not. IMO this is what was asked by the question author. So yes, my mean of the 'out of scope' is about tweaking the 'two pages reserve' in some way. Since M_USE_RESERVE is no longer deprecated in HEAD, here is my proposed man page update to malloc(9): Index: share/man/man9/malloc.9 === --- share/man/man9/malloc.9 (revision 243091) +++ share/man/man9/malloc.9 (working copy) @@ -29,7 +29,7 @@ .\ $NetBSD: malloc.9,v 1.3 1996/11/11 00:05:11 lukem Exp $ .\ $FreeBSD$ .\ -.Dd January 28, 2012 +.Dd November 15, 2012 .Dt MALLOC 9 .Os .Sh NAME @@ -153,13 +153,12 @@ if .Dv M_WAITOK is specified. .It Dv M_USE_RESERVE -Indicates that the system can dig into its reserve in order to obtain the -requested memory. -This option used to be called -.Dv M_KERNEL -but has been renamed to something more obvious. -This option has been deprecated and is slowly being removed from the kernel, -and so should not be used with any new programming. +Indicates that the system can use its reserve of memory to satisfy the +request. +This option should only be used in combination with +.Dv M_NOWAIT +when an allocation failure cannot be tolerated by the caller without +catastrophic effects on the system. .El .Pp Exactly one of either ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On Thu, Nov 15, 2012 at 11:32:18AM -0600, Alan Cox wrote: On 11/13/2012 05:54, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 05:10:01PM -0600, Alan Cox wrote: On 11/12/2012 3:48 PM, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote: This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. This is out of scope of the change. But it is required for any further adjustements. I would suggest a somewhat different response: The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and does reintroduce M_USE_RESERVE as a way to specify dig deep. Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory down to two pages. The effect of the patch is to stop M_NOWAIT at two pages rather than allowing it to continue to zero pages. When you say, This is out of scope ..., I believe that you are referring to changing two pages into something larger. I agree that this is out of scope for the current change. I referred exactly to the difference between M_USE_RESERVE set or not. IMO this is what was asked by the question author. So yes, my mean of the 'out of scope' is about tweaking the 'two pages reserve' in some way. Since M_USE_RESERVE is no longer deprecated in HEAD, here is my proposed man page update to malloc(9): Index: share/man/man9/malloc.9 === --- share/man/man9/malloc.9 (revision 243091) +++ share/man/man9/malloc.9 (working copy) @@ -29,7 +29,7 @@ .\ $NetBSD: malloc.9,v 1.3 1996/11/11 00:05:11 lukem Exp $ .\ $FreeBSD$ .\ -.Dd January 28, 2012 +.Dd November 15, 2012 .Dt MALLOC 9 .Os .Sh NAME @@ -153,13 +153,12 @@ if .Dv M_WAITOK is specified. .It Dv M_USE_RESERVE -Indicates that the system can dig into its reserve in order to obtain the -requested memory. -This option used to be called -.Dv M_KERNEL -but has been renamed to something more obvious. -This option has been deprecated and is slowly being removed from the kernel, -and so should not be used with any new programming. +Indicates that the system can use its reserve of memory to satisfy the +request. +This option should only be used in combination with +.Dv M_NOWAIT +when an allocation failure cannot be tolerated by the caller without +catastrophic effects on the system. .El .Pp Exactly one of either The text looks fine. Shouldn't the requirement for M_USE_RESERVE be also expressed in KASSERT, like this: diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index d9e4692..f8a4f70 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -353,6 +351,9 @@ malloc2vm_flags(int malloc_flags) { int pflags; + KASSERT((malloc_flags M_USE_RESERVE) == 0 || + (malloc_flags M_NOWAIT) != 0, + (M_USE_RESERVE requires M_NOWAIT)); pflags = (malloc_flags M_USE_RESERVE) != 0 ? VM_ALLOC_INTERRUPT : VM_ALLOC_SYSTEM; if ((malloc_flags M_ZERO) != 0) I understand that this could be added to places of the allocator's entries, but I think that the page allocations are fine too. pgptBhkylD1fK.pgp Description: PGP signature
Re: Memory reserves or lack thereof
On 11/15/2012 12:21, Konstantin Belousov wrote: On Thu, Nov 15, 2012 at 11:32:18AM -0600, Alan Cox wrote: On 11/13/2012 05:54, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 05:10:01PM -0600, Alan Cox wrote: On 11/12/2012 3:48 PM, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote: This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. This is out of scope of the change. But it is required for any further adjustements. I would suggest a somewhat different response: The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and does reintroduce M_USE_RESERVE as a way to specify dig deep. Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory down to two pages. The effect of the patch is to stop M_NOWAIT at two pages rather than allowing it to continue to zero pages. When you say, This is out of scope ..., I believe that you are referring to changing two pages into something larger. I agree that this is out of scope for the current change. I referred exactly to the difference between M_USE_RESERVE set or not. IMO this is what was asked by the question author. So yes, my mean of the 'out of scope' is about tweaking the 'two pages reserve' in some way. Since M_USE_RESERVE is no longer deprecated in HEAD, here is my proposed man page update to malloc(9): Index: share/man/man9/malloc.9 === --- share/man/man9/malloc.9 (revision 243091) +++ share/man/man9/malloc.9 (working copy) @@ -29,7 +29,7 @@ .\ $NetBSD: malloc.9,v 1.3 1996/11/11 00:05:11 lukem Exp $ .\ $FreeBSD$ .\ -.Dd January 28, 2012 +.Dd November 15, 2012 .Dt MALLOC 9 .Os .Sh NAME @@ -153,13 +153,12 @@ if .Dv M_WAITOK is specified. .It Dv M_USE_RESERVE -Indicates that the system can dig into its reserve in order to obtain the -requested memory. -This option used to be called -.Dv M_KERNEL -but has been renamed to something more obvious. -This option has been deprecated and is slowly being removed from the kernel, -and so should not be used with any new programming. +Indicates that the system can use its reserve of memory to satisfy the +request. +This option should only be used in combination with +.Dv M_NOWAIT +when an allocation failure cannot be tolerated by the caller without +catastrophic effects on the system. .El .Pp Exactly one of either The text looks fine. Shouldn't the requirement for M_USE_RESERVE be also expressed in KASSERT, like this: diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index d9e4692..f8a4f70 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -353,6 +351,9 @@ malloc2vm_flags(int malloc_flags) { int pflags; + KASSERT((malloc_flags M_USE_RESERVE) == 0 || + (malloc_flags M_NOWAIT) != 0, + (M_USE_RESERVE requires M_NOWAIT)); pflags = (malloc_flags M_USE_RESERVE) != 0 ? VM_ALLOC_INTERRUPT : VM_ALLOC_SYSTEM; if ((malloc_flags M_ZERO) != 0) I understand that this could be added to places of the allocator's entries, but I think that the page allocations are fine too. Yes, please do that. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On Mon, Nov 12, 2012 at 05:10:01PM -0600, Alan Cox wrote: On 11/12/2012 3:48 PM, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote: This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. This is out of scope of the change. But it is required for any further adjustements. I would suggest a somewhat different response: The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and does reintroduce M_USE_RESERVE as a way to specify dig deep. Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory down to two pages. The effect of the patch is to stop M_NOWAIT at two pages rather than allowing it to continue to zero pages. When you say, This is out of scope ..., I believe that you are referring to changing two pages into something larger. I agree that this is out of scope for the current change. I referred exactly to the difference between M_USE_RESERVE set or not. IMO this is what was asked by the question author. So yes, my mean of the 'out of scope' is about tweaking the 'two pages reserve' in some way. pgpAl2UTJQyEa.pgp Description: PGP signature
Re: Memory reserves or lack thereof
On 11/12/2012 11:35, Alan Cox wrote: On 11/12/2012 07:36, Konstantin Belousov wrote: On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote: On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov kostik...@gmail.comwrote: On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote: I have a memory subsystem design question that I'm hoping someone can answer. I've been looking at a machine that is completely out of memory, as in v_free_count = 0, v_cache_count = 0, I wondered how a machine could completely run out of memory like this, especially after finding a lack of interrupt storms or other pathologies that would tend to overcommit memory. So I started investigating. Most allocators come down to vm_page_alloc(), which has this guard: if ((curproc == pageproc) (page_req != VM_ALLOC_INTERRUPT)) { page_req = VM_ALLOC_SYSTEM; }; if (cnt.v_free_count + cnt.v_cache_count cnt.v_free_reserved || (page_req == VM_ALLOC_SYSTEM cnt.v_free_count + cnt.v_cache_count cnt.v_interrupt_free_min) || (page_req == VM_ALLOC_INTERRUPT cnt.v_free_count + cnt.v_cache_count 0)) { The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every last page. From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, perhaps only used from interrupt threads. Not so, see kmem_malloc() or uma_small_alloc() which both contain this mapping: if ((flags (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; Note that M_USE_RESERVE has been deprecated and is used in just a handful of places. Also note that lots of code paths come through these routines. What this means is essentially _any_ allocation using M_NOWAIT will bypass whatever reserves have been held back and will take every last page available. There is no documentation stating M_NOWAIT has this side effect of essentially being privileged, so any innocuous piece of code that can't block will use it. And of course M_NOWAIT is literally used all over. It looks to me like the design goal of the BSD allocators is on recovery; it will give all pages away knowing it can recover. Am I missing anything? I would have expected some small number of pages to be held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of back door for grabbing memory. Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. This had the unintended consequence of introducing a lot of memory allocations in the top-half of the kernel, i.e., non-interrupt handling code, that were digging deep into the cache/free page queues. Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE allocation is less likely to succeed than an M_NOWAIT allocation. However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it could only allocate a free page. M_USE_RESERVE said that it ok to allocate a cached page even though M_NOWAIT was specified. Consequently, the system wouldn't dig as far into the free page queue if M_USE_RESERVE was specified, because it was allowed to reclaim a cached page. In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. Alan P.S. I suspect that we should also increase the size of the page reserve that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How many legitimate users of a new M_USE_RESERVE-like flag in today's kernel could actually be satisfied by two pages? I am almost sure that most of people who put the M_NOWAIT flag, do not know the 'allow the deeper drain of free queue' effect. As such, I believe we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations of the problematic places would be in the swapout path. I found a single explicit use of M_USE_RESERVE in the kernel, so the flip is relatively simple.
Re: Memory reserves or lack thereof
Hey, great catch! adrian On 13 November 2012 12:04, Alan Cox a...@rice.edu wrote: On 11/12/2012 11:35, Alan Cox wrote: On 11/12/2012 07:36, Konstantin Belousov wrote: On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote: On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov kostik...@gmail.comwrote: On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote: I have a memory subsystem design question that I'm hoping someone can answer. I've been looking at a machine that is completely out of memory, as in v_free_count = 0, v_cache_count = 0, I wondered how a machine could completely run out of memory like this, especially after finding a lack of interrupt storms or other pathologies that would tend to overcommit memory. So I started investigating. Most allocators come down to vm_page_alloc(), which has this guard: if ((curproc == pageproc) (page_req != VM_ALLOC_INTERRUPT)) { page_req = VM_ALLOC_SYSTEM; }; if (cnt.v_free_count + cnt.v_cache_count cnt.v_free_reserved || (page_req == VM_ALLOC_SYSTEM cnt.v_free_count + cnt.v_cache_count cnt.v_interrupt_free_min) || (page_req == VM_ALLOC_INTERRUPT cnt.v_free_count + cnt.v_cache_count 0)) { The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every last page. From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, perhaps only used from interrupt threads. Not so, see kmem_malloc() or uma_small_alloc() which both contain this mapping: if ((flags (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; Note that M_USE_RESERVE has been deprecated and is used in just a handful of places. Also note that lots of code paths come through these routines. What this means is essentially _any_ allocation using M_NOWAIT will bypass whatever reserves have been held back and will take every last page available. There is no documentation stating M_NOWAIT has this side effect of essentially being privileged, so any innocuous piece of code that can't block will use it. And of course M_NOWAIT is literally used all over. It looks to me like the design goal of the BSD allocators is on recovery; it will give all pages away knowing it can recover. Am I missing anything? I would have expected some small number of pages to be held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of back door for grabbing memory. Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. This had the unintended consequence of introducing a lot of memory allocations in the top-half of the kernel, i.e., non-interrupt handling code, that were digging deep into the cache/free page queues. Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE allocation is less likely to succeed than an M_NOWAIT allocation. However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it could only allocate a free page. M_USE_RESERVE said that it ok to allocate a cached page even though M_NOWAIT was specified. Consequently, the system wouldn't dig as far into the free page queue if M_USE_RESERVE was specified, because it was allowed to reclaim a cached page. In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. Alan P.S. I suspect that we should also increase the size of the page reserve that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How many legitimate users of a new M_USE_RESERVE-like flag in today's kernel could actually be satisfied by two pages? I am almost sure that most of people who put the M_NOWAIT flag, do not know the 'allow the deeper drain of free queue' effect. As such, I believe we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations of the problematic places would be in the swapout path. I found a single
Re: Memory reserves or lack thereof
On 11 November 2012 20:24, Alfred Perlstein bri...@mu.org wrote: I think very few of the m_nowaits actually need the reserve behavior. We should probably switch away from it digging that deep by default and introduce a flag and/or a per thread flag to set the behavior. There's already a perfectly fine flag - M_WAITOK. Just don't hold any locks, right? :) Adrian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On 11.11.2012 22:40, Alan Cox wrote: On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov kostik...@gmail.comwrote: Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. Yes, we have many places where we don't want to sleep for example in the network code. There we simply want to be told that we've run out of memory and handle the failure. It's expected to happen from time to time. We don't need or want to dig deep or into reserves. Packets are expected to get lost from time to time and upper layer protocols will handle retransmits just fine. What we *don't* want normally is to get blocked on a failing memory allocation. We'd rather drop this one and go on with the next packet to avoid the head of line blocking problem where everything cascades to a total halt. As a side note we don't do many, if any, true interrupt time allocations anymore. Usually the interrupt is just acknowledged in interrupt context and a taskqueue or ithread is scheduled to do all the hard work. Neither runs in interrupt context. This had the unintended consequence of introducing a lot of memory allocations in the top-half of the kernel, i.e., non-interrupt handling code, that were digging deep into the cache/free page queues. Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE allocation is less likely to succeed than an M_NOWAIT allocation. However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it could only allocate a free page. M_USE_RESERVE said that it ok to allocate a cached page even though M_NOWAIT was specified. Consequently, the system wouldn't dig as far into the free page queue if M_USE_RESERVE was specified, because it was allowed to reclaim a cached page. In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. I don't think many places depend on M_NOWAIT digging deep. I'm perfectly happy with having M_NOWAIT give up on first try. Only together with M_TRY_REALLY_HARD it would dig into reserves. PS: We have a really nasty namespace collision with the mbuf flags which use the M_* prefix as well. -- Andre ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On 12.11.2012 03:02, Adrian Chadd wrote: On 11 November 2012 13:40, Alan Cox alan.l@gmail.com wrote: Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. Well, what's the current set of best practices for allocating mbufs? If an allocation is driven by user space then you can use M_WAITOK. If an allocation is driven by the driver or kernel (callout and so on) you do M_NOWAIT and handle a failure by trying again later either directly by rescheduling the callout or by the upper layer retransmit logic. On top of that individual mbuf allocation or stitching mbufs and clusters together manually is deprecated. If every possible you should use m_getm2(). I don't mind going through ath(4) and net80211(4), looking to make it behave better with mbuf allocations. There's 49 M_NOWAIT's in net80211 and 10 in ath(4). I wonder how many of them are synonyms with don't fail allocating, too. Hm. Mbuf allocations are normally allowed to fail without serious after effects other than retransmits and some overall recovery pain. Only non-mbuf memory allocations for important structures or state that can't be recreated on retransmit should dig into reserves. Normally this is a very rare case in network related code. -- Andre ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote: On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov kostik...@gmail.comwrote: On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote: I have a memory subsystem design question that I'm hoping someone can answer. I've been looking at a machine that is completely out of memory, as in v_free_count = 0, v_cache_count = 0, I wondered how a machine could completely run out of memory like this, especially after finding a lack of interrupt storms or other pathologies that would tend to overcommit memory. So I started investigating. Most allocators come down to vm_page_alloc(), which has this guard: if ((curproc == pageproc) (page_req != VM_ALLOC_INTERRUPT)) { page_req = VM_ALLOC_SYSTEM; }; if (cnt.v_free_count + cnt.v_cache_count cnt.v_free_reserved || (page_req == VM_ALLOC_SYSTEM cnt.v_free_count + cnt.v_cache_count cnt.v_interrupt_free_min) || (page_req == VM_ALLOC_INTERRUPT cnt.v_free_count + cnt.v_cache_count 0)) { The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every last page. From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, perhaps only used from interrupt threads. Not so, see kmem_malloc() or uma_small_alloc() which both contain this mapping: if ((flags (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; Note that M_USE_RESERVE has been deprecated and is used in just a handful of places. Also note that lots of code paths come through these routines. What this means is essentially _any_ allocation using M_NOWAIT will bypass whatever reserves have been held back and will take every last page available. There is no documentation stating M_NOWAIT has this side effect of essentially being privileged, so any innocuous piece of code that can't block will use it. And of course M_NOWAIT is literally used all over. It looks to me like the design goal of the BSD allocators is on recovery; it will give all pages away knowing it can recover. Am I missing anything? I would have expected some small number of pages to be held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of back door for grabbing memory. Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. This had the unintended consequence of introducing a lot of memory allocations in the top-half of the kernel, i.e., non-interrupt handling code, that were digging deep into the cache/free page queues. Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE allocation is less likely to succeed than an M_NOWAIT allocation. However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it could only allocate a free page. M_USE_RESERVE said that it ok to allocate a cached page even though M_NOWAIT was specified. Consequently, the system wouldn't dig as far into the free page queue if M_USE_RESERVE was specified, because it was allowed to reclaim a cached page. In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. Alan P.S. I suspect that we should also increase the size of the page reserve that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How many legitimate users of a new M_USE_RESERVE-like flag in today's kernel could actually be satisfied by two pages? I am almost sure that most of people who put the M_NOWAIT flag, do not know the 'allow the deeper drain of free queue' effect. As such, I believe we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations of the problematic places would be in the swapout path. I found a single explicit use of M_USE_RESERVE in the kernel, so the flip
Re: Memory reserves or lack thereof
On Mon, Nov 12, 2012 at 03:36:38PM +0200, Konstantin Belousov wrote: On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote: On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov kostik...@gmail.comwrote: On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote: I have a memory subsystem design question that I'm hoping someone can answer. I've been looking at a machine that is completely out of memory, as in v_free_count = 0, v_cache_count = 0, I wondered how a machine could completely run out of memory like this, especially after finding a lack of interrupt storms or other pathologies that would tend to overcommit memory. So I started investigating. Most allocators come down to vm_page_alloc(), which has this guard: if ((curproc == pageproc) (page_req != VM_ALLOC_INTERRUPT)) { page_req = VM_ALLOC_SYSTEM; }; if (cnt.v_free_count + cnt.v_cache_count cnt.v_free_reserved || (page_req == VM_ALLOC_SYSTEM cnt.v_free_count + cnt.v_cache_count cnt.v_interrupt_free_min) || (page_req == VM_ALLOC_INTERRUPT cnt.v_free_count + cnt.v_cache_count 0)) { The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every last page. From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, perhaps only used from interrupt threads. Not so, see kmem_malloc() or uma_small_alloc() which both contain this mapping: if ((flags (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; Note that M_USE_RESERVE has been deprecated and is used in just a handful of places. Also note that lots of code paths come through these routines. What this means is essentially _any_ allocation using M_NOWAIT will bypass whatever reserves have been held back and will take every last page available. There is no documentation stating M_NOWAIT has this side effect of essentially being privileged, so any innocuous piece of code that can't block will use it. And of course M_NOWAIT is literally used all over. It looks to me like the design goal of the BSD allocators is on recovery; it will give all pages away knowing it can recover. Am I missing anything? I would have expected some small number of pages to be held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of back door for grabbing memory. Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. This had the unintended consequence of introducing a lot of memory allocations in the top-half of the kernel, i.e., non-interrupt handling code, that were digging deep into the cache/free page queues. Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE allocation is less likely to succeed than an M_NOWAIT allocation. However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it could only allocate a free page. M_USE_RESERVE said that it ok to allocate a cached page even though M_NOWAIT was specified. Consequently, the system wouldn't dig as far into the free page queue if M_USE_RESERVE was specified, because it was allowed to reclaim a cached page. In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. Alan P.S. I suspect that we should also increase the size of the page reserve that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How many legitimate users of a new M_USE_RESERVE-like flag in today's kernel could actually be satisfied by two pages? I am almost sure that most of people who put the M_NOWAIT flag, do not know the 'allow the deeper drain of free queue' effect. As such, I believe we should flip the meaning of
Re: Memory reserves or lack thereof
On Mon, 2012-11-12 at 13:18 +0100, Andre Oppermann wrote: Well, what's the current set of best practices for allocating mbufs? If an allocation is driven by user space then you can use M_WAITOK. If an allocation is driven by the driver or kernel (callout and so on) you do M_NOWAIT and handle a failure by trying again later either directly by rescheduling the callout or by the upper layer retransmit logic. On top of that individual mbuf allocation or stitching mbufs and clusters together manually is deprecated. If every possible you should use m_getm2(). root@pico:/root # man m_getm2 No manual entry for m_getm2 So when you say manually stitching mbufs together is deprecated, I take you mean in the case where you're letting the mbuf routines allocate the actual buffer space for you? I've got an ethernet driver on an ARM SoC in which the hardware receives into a series of buffers fixed at 128 bytes. Right now the code is allocating a cluster and then looping using m_append() to reassemble these buffers back into a full contiguous frame in a cluster. I was going to have a shot at using MEXTADD() to manually string the series of hardware/dma buffers together without copying the data. Is that sort of usage still a good idea? (And would it actually be a performance win? If I hand it off to the net stack and an m_pullup() or similar is going to happen along the way anyway, I might as well do it at driver level.) -- Ian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On 12.11.2012 15:47, Ian Lepore wrote: On Mon, 2012-11-12 at 13:18 +0100, Andre Oppermann wrote: Well, what's the current set of best practices for allocating mbufs? If an allocation is driven by user space then you can use M_WAITOK. If an allocation is driven by the driver or kernel (callout and so on) you do M_NOWAIT and handle a failure by trying again later either directly by rescheduling the callout or by the upper layer retransmit logic. On top of that individual mbuf allocation or stitching mbufs and clusters together manually is deprecated. If every possible you should use m_getm2(). root@pico:/root # man m_getm2 No manual entry for m_getm2 Oops... Have to fix that. So when you say manually stitching mbufs together is deprecated, I take you mean in the case where you're letting the mbuf routines allocate the actual buffer space for you? I mean allocating an mbuf, a cluster and then stitching them together. You can it in one with m_getcl(). I've got an ethernet driver on an ARM SoC in which the hardware receives into a series of buffers fixed at 128 bytes. Right now the code is allocating a cluster and then looping using m_append() to reassemble these buffers back into a full contiguous frame in a cluster. I was going to have a shot at using MEXTADD() to manually string the series of hardware/dma buffers together without copying the data. Is that sort of usage still a good idea? (And would it actually be a performance win? That really depends on the particular usage. Attaching the 128 byte buffers to mbufs probably isn't much of a win considering an mbuf is 256 bytes in size. You could just as well copy each 128 buf into the data section. Allocating a 2K cluster and copying into it is more efficient on the overall system. If I hand it off to the net stack and an m_pullup() or similar is going to happen along the way anyway, I might as well do it at driver level.) If you properly m_align() the mbuf cluster before you copy into it there shouldn't be any m_pullup's happening. -- Andre ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On 11/12/2012 07:36, Konstantin Belousov wrote: On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote: On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov kostik...@gmail.comwrote: On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote: I have a memory subsystem design question that I'm hoping someone can answer. I've been looking at a machine that is completely out of memory, as in v_free_count = 0, v_cache_count = 0, I wondered how a machine could completely run out of memory like this, especially after finding a lack of interrupt storms or other pathologies that would tend to overcommit memory. So I started investigating. Most allocators come down to vm_page_alloc(), which has this guard: if ((curproc == pageproc) (page_req != VM_ALLOC_INTERRUPT)) { page_req = VM_ALLOC_SYSTEM; }; if (cnt.v_free_count + cnt.v_cache_count cnt.v_free_reserved || (page_req == VM_ALLOC_SYSTEM cnt.v_free_count + cnt.v_cache_count cnt.v_interrupt_free_min) || (page_req == VM_ALLOC_INTERRUPT cnt.v_free_count + cnt.v_cache_count 0)) { The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every last page. From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, perhaps only used from interrupt threads. Not so, see kmem_malloc() or uma_small_alloc() which both contain this mapping: if ((flags (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; Note that M_USE_RESERVE has been deprecated and is used in just a handful of places. Also note that lots of code paths come through these routines. What this means is essentially _any_ allocation using M_NOWAIT will bypass whatever reserves have been held back and will take every last page available. There is no documentation stating M_NOWAIT has this side effect of essentially being privileged, so any innocuous piece of code that can't block will use it. And of course M_NOWAIT is literally used all over. It looks to me like the design goal of the BSD allocators is on recovery; it will give all pages away knowing it can recover. Am I missing anything? I would have expected some small number of pages to be held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of back door for grabbing memory. Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. This had the unintended consequence of introducing a lot of memory allocations in the top-half of the kernel, i.e., non-interrupt handling code, that were digging deep into the cache/free page queues. Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE allocation is less likely to succeed than an M_NOWAIT allocation. However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it could only allocate a free page. M_USE_RESERVE said that it ok to allocate a cached page even though M_NOWAIT was specified. Consequently, the system wouldn't dig as far into the free page queue if M_USE_RESERVE was specified, because it was allowed to reclaim a cached page. In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. Alan P.S. I suspect that we should also increase the size of the page reserve that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How many legitimate users of a new M_USE_RESERVE-like flag in today's kernel could actually be satisfied by two pages? I am almost sure that most of people who put the M_NOWAIT flag, do not know the 'allow the deeper drain of free queue' effect. As such, I believe we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations of the problematic places would be in the swapout path. I found a single explicit use of M_USE_RESERVE in the kernel, so the flip is relatively simple. Agreed. Most recently I eliminated several
Re: Memory reserves or lack thereof
On Nov 12, 2012, at 4:11 AM, Andre Oppermann an...@freebsd.org wrote: I don't think many places depend on M_NOWAIT digging deep. I'm perfectly happy with having M_NOWAIT give up on first try. Only together with M_TRY_REALLY_HARD it would dig into reserves. PS: We have a really nasty namespace collision with the mbuf flags which use the M_* prefix as well. Agreed. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On Mon, Nov 12, 2012 at 11:35:42AM -0600, Alan Cox wrote: Agreed. Most recently I eliminated several uses from the arm pmap implementations. There is, however, one other use: ofed/include/linux/gfp.h:#defineGFP_ATOMIC (M_NOWAIT | M_USE_RESERVE) Yes, I forgot to mention this. I have no idea about semantic of GFP_ATOMIC compat flag. Below is the updated patch with two your notes applied. diff --git a/sys/amd64/amd64/uma_machdep.c b/sys/amd64/amd64/uma_machdep.c index dc9c307..ab1e869 100644 --- a/sys/amd64/amd64/uma_machdep.c +++ b/sys/amd64/amd64/uma_machdep.c @@ -29,6 +29,7 @@ __FBSDID($FreeBSD$); #include sys/param.h #include sys/lock.h +#include sys/malloc.h #include sys/mutex.h #include sys/systm.h #include vm/vm.h @@ -48,12 +49,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) int pflags; *flags = UMA_SLAB_PRIV; - if ((wait (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; - else - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; - if (wait M_ZERO) - pflags |= VM_ALLOC_ZERO; + pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED); for (;;) { m = vm_page_alloc(NULL, 0, pflags); if (m == NULL) { diff --git a/sys/arm/arm/vm_machdep.c b/sys/arm/arm/vm_machdep.c index f60cdb1..75366e3 100644 --- a/sys/arm/arm/vm_machdep.c +++ b/sys/arm/arm/vm_machdep.c @@ -651,12 +651,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) ret = ((void *)kmem_malloc(kmem_map, bytes, M_NOWAIT)); return (ret); } - if ((wait (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; - else - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; - if (wait M_ZERO) - pflags |= VM_ALLOC_ZERO; + pflags = m2vm_flags(wait, VM_ALLOC_WIRED); for (;;) { m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); if (m == NULL) { diff --git a/sys/fs/devfs/devfs_devs.c b/sys/fs/devfs/devfs_devs.c index 71caa29..2ce1ca6 100644 --- a/sys/fs/devfs/devfs_devs.c +++ b/sys/fs/devfs/devfs_devs.c @@ -121,7 +121,7 @@ devfs_alloc(int flags) struct cdev *cdev; struct timespec ts; - cdp = malloc(sizeof *cdp, M_CDEVP, M_USE_RESERVE | M_ZERO | + cdp = malloc(sizeof *cdp, M_CDEVP, M_ZERO | ((flags MAKEDEV_NOWAIT) ? M_NOWAIT : M_WAITOK)); if (cdp == NULL) return (NULL); diff --git a/sys/ia64/ia64/uma_machdep.c b/sys/ia64/ia64/uma_machdep.c index 37353ff..9f77762 100644 --- a/sys/ia64/ia64/uma_machdep.c +++ b/sys/ia64/ia64/uma_machdep.c @@ -46,12 +46,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) int pflags; *flags = UMA_SLAB_PRIV; - if ((wait (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; - else - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; - if (wait M_ZERO) - pflags |= VM_ALLOC_ZERO; + pflags = m2vm_flags(wait, VM_ALLOC_WIRED); for (;;) { m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); diff --git a/sys/mips/mips/uma_machdep.c b/sys/mips/mips/uma_machdep.c index 798e632..24baef0 100644 --- a/sys/mips/mips/uma_machdep.c +++ b/sys/mips/mips/uma_machdep.c @@ -48,11 +48,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) void *va; *flags = UMA_SLAB_PRIV; - - if ((wait (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) - pflags = VM_ALLOC_INTERRUPT; - else - pflags = VM_ALLOC_SYSTEM; + pflags = m2vm_flags(wait, 0); for (;;) { m = pmap_alloc_direct_page(0, pflags); diff --git a/sys/powerpc/aim/mmu_oea64.c b/sys/powerpc/aim/mmu_oea64.c index a491680..3e320b9 100644 --- a/sys/powerpc/aim/mmu_oea64.c +++ b/sys/powerpc/aim/mmu_oea64.c @@ -1369,12 +1369,7 @@ moea64_uma_page_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) *flags = UMA_SLAB_PRIV; needed_lock = !PMAP_LOCKED(kernel_pmap); -if ((wait (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) -pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; -else -pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; -if (wait M_ZERO) -pflags |= VM_ALLOC_ZERO; + pflags = m2vm_flags(wait, VM_ALLOC_WIRED); for (;;) { m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); diff --git a/sys/powerpc/aim/slb.c b/sys/powerpc/aim/slb.c index 162c7fb..3882bfa 100644 --- a/sys/powerpc/aim/slb.c +++ b/sys/powerpc/aim/slb.c @@ -483,12 +483,7 @@
Re: Memory reserves or lack thereof
This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. Sushanth --- On Mon, 11/12/12, Konstantin Belousov kostik...@gmail.com wrote: From: Konstantin Belousov kostik...@gmail.com Subject: Re: Memory reserves or lack thereof To: a...@freebsd.org Cc: p...@freebsd.org, Sears, Steven steven.se...@netapp.com, freebsd-hackers@freebsd.org freebsd-hackers@freebsd.org Date: Monday, November 12, 2012, 5:36 AM On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote: On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov kostik...@gmail.comwrote: On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote: I have a memory subsystem design question that I'm hoping someone can answer. I've been looking at a machine that is completely out of memory, as in v_free_count = 0, v_cache_count = 0, I wondered how a machine could completely run out of memory like this, especially after finding a lack of interrupt storms or other pathologies that would tend to overcommit memory. So I started investigating. Most allocators come down to vm_page_alloc(), which has this guard: if ((curproc == pageproc) (page_req != VM_ALLOC_INTERRUPT)) { page_req = VM_ALLOC_SYSTEM; }; if (cnt.v_free_count + cnt.v_cache_count cnt.v_free_reserved || (page_req == VM_ALLOC_SYSTEM cnt.v_free_count + cnt.v_cache_count cnt.v_interrupt_free_min) || (page_req == VM_ALLOC_INTERRUPT cnt.v_free_count + cnt.v_cache_count 0)) { The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every last page. From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, perhaps only used from interrupt threads. Not so, see kmem_malloc() or uma_small_alloc() which both contain this mapping: if ((flags (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; Note that M_USE_RESERVE has been deprecated and is used in just a handful of places. Also note that lots of code paths come through these routines. What this means is essentially _any_ allocation using M_NOWAIT will bypass whatever reserves have been held back and will take every last page available. There is no documentation stating M_NOWAIT has this side effect of essentially being privileged, so any innocuous piece of code that can't block will use it. And of course M_NOWAIT is literally used all over. It looks to me like the design goal of the BSD allocators is on recovery; it will give all pages away knowing it can recover. Am I missing anything? I would have expected some small number of pages to be held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of back door for grabbing memory. Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. This had the unintended consequence of introducing a lot of memory allocations in the top-half of the kernel, i.e., non-interrupt handling code, that were digging deep into the cache/free page queues. Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE allocation is less likely to succeed than an M_NOWAIT allocation. However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it could only allocate a free page. M_USE_RESERVE said that it ok to allocate a cached page even though M_NOWAIT was specified. Consequently, the system wouldn't dig as far into the free page queue if M_USE_RESERVE was specified, because it was allowed to reclaim a cached page. In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current
Re: Memory reserves or lack thereof
On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote: This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. This is out of scope of the change. But it is required for any further adjustements. pgpHI7rQOhvFP.pgp Description: PGP signature
Re: Memory reserves or lack thereof
On 11/12/2012 3:48 PM, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote: This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. This is out of scope of the change. But it is required for any further adjustements. I would suggest a somewhat different response: The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and does reintroduce M_USE_RESERVE as a way to specify dig deep. Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory down to two pages. The effect of the patch is to stop M_NOWAIT at two pages rather than allowing it to continue to zero pages. When you say, This is out of scope ..., I believe that you are referring to changing two pages into something larger. I agree that this is out of scope for the current change. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
.. wait, so what exactly would the difference be between M_NOWAIT and M_WAITOK? adrian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On 11/12/2012 5:24 PM, Adrian Chadd wrote: .. wait, so what exactly would the difference be between M_NOWAIT and M_WAITOK? Whether or not the allocation can sleep until memory becomes available. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On 12 November 2012 15:26, Alan Cox a...@rice.edu wrote: On 11/12/2012 5:24 PM, Adrian Chadd wrote: .. wait, so what exactly would the difference be between M_NOWAIT and M_WAITOK? Whether or not the allocation can sleep until memory becomes available. Ok, so we're still maintaining that particular behaviour. Cool. Adrian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
--- On Mon, 11/12/12, Alan Cox a...@rice.edu wrote: From: Alan Cox a...@rice.edu Subject: Re: Memory reserves or lack thereof To: Konstantin Belousov kostik...@gmail.com Cc: Sushanth Rai sushanth_...@yahoo.com, a...@freebsd.org, p...@freebsd.org, StevenSears steven.se...@netapp.com, freebsd-hackers@freebsd.org freebsd-hackers@freebsd.org Date: Monday, November 12, 2012, 3:10 PM On 11/12/2012 3:48 PM, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote: This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. This is out of scope of the change. But it is required for any further adjustements. I would suggest a somewhat different response: The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and does reintroduce M_USE_RESERVE as a way to specify dig deep. Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory down to two pages. The effect of the patch is to stop M_NOWAIT at two pages rather than allowing it to continue to zero pages. Thanks for the correction. I was associating VM_ALLOC_SYSTEM with just M_NOWAIT as it seemed in the first verion of the patch. Sushanth ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On 11/12/12 3:49 PM, Adrian Chadd wrote: On 12 November 2012 15:26, Alan Cox a...@rice.edu wrote: On 11/12/2012 5:24 PM, Adrian Chadd wrote: .. wait, so what exactly would the difference be between M_NOWAIT and M_WAITOK? Whether or not the allocation can sleep until memory becomes available. Ok, so we're still maintaining that particular behaviour. Cool. no mem | mem avail -- M_WAITOK | wait, then success |success | -- M_NOWAIT | returns failure|success | -- the question is whether the top left can ever fail for any other reason. Adrian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov kostik...@gmail.comwrote: On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote: I have a memory subsystem design question that I'm hoping someone can answer. I've been looking at a machine that is completely out of memory, as in v_free_count = 0, v_cache_count = 0, I wondered how a machine could completely run out of memory like this, especially after finding a lack of interrupt storms or other pathologies that would tend to overcommit memory. So I started investigating. Most allocators come down to vm_page_alloc(), which has this guard: if ((curproc == pageproc) (page_req != VM_ALLOC_INTERRUPT)) { page_req = VM_ALLOC_SYSTEM; }; if (cnt.v_free_count + cnt.v_cache_count cnt.v_free_reserved || (page_req == VM_ALLOC_SYSTEM cnt.v_free_count + cnt.v_cache_count cnt.v_interrupt_free_min) || (page_req == VM_ALLOC_INTERRUPT cnt.v_free_count + cnt.v_cache_count 0)) { The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every last page. From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, perhaps only used from interrupt threads. Not so, see kmem_malloc() or uma_small_alloc() which both contain this mapping: if ((flags (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; Note that M_USE_RESERVE has been deprecated and is used in just a handful of places. Also note that lots of code paths come through these routines. What this means is essentially _any_ allocation using M_NOWAIT will bypass whatever reserves have been held back and will take every last page available. There is no documentation stating M_NOWAIT has this side effect of essentially being privileged, so any innocuous piece of code that can't block will use it. And of course M_NOWAIT is literally used all over. It looks to me like the design goal of the BSD allocators is on recovery; it will give all pages away knowing it can recover. Am I missing anything? I would have expected some small number of pages to be held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of back door for grabbing memory. Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. This had the unintended consequence of introducing a lot of memory allocations in the top-half of the kernel, i.e., non-interrupt handling code, that were digging deep into the cache/free page queues. Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE allocation is less likely to succeed than an M_NOWAIT allocation. However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it could only allocate a free page. M_USE_RESERVE said that it ok to allocate a cached page even though M_NOWAIT was specified. Consequently, the system wouldn't dig as far into the free page queue if M_USE_RESERVE was specified, because it was allowed to reclaim a cached page. In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. Alan P.S. I suspect that we should also increase the size of the page reserve that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How many legitimate users of a new M_USE_RESERVE-like flag in today's kernel could actually be satisfied by two pages? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
Alan writes: In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. find /usr/src/sys | xargs grep M_NOWAIT | wc -l 2101 Sounds like a lot of work that would need to happen atomically. Would this work: M_NO_WAIT do not sleep, do not dig deep unless M_USE_RESERVE also set M_USE_RESERVE dig deep M_NOWAIT M_NO_WAIT | M_USE_RESERVE (deprecated) New code avoids using M_NOWAIT. Existing code continues working the same way. As time permits, old code is converted to new flags. Eventually M_NOWAIT goes away. Pro: the amount of code that needs to change atomically is much smaller. Con: (1) Have to remember (or look up) difference between M_NOWAIT and M_NO_WAIT. Maybe calling the new flag M_NO_SLEEP would help? (2) Would M_NOWAIT really ever go away? The spl() calls haven't, even after some cage rattling. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On 11 November 2012 13:40, Alan Cox alan.l@gmail.com wrote: Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. Well, what's the current set of best practices for allocating mbufs? I don't mind going through ath(4) and net80211(4), looking to make it behave better with mbuf allocations. There's 49 M_NOWAIT's in net80211 and 10 in ath(4). I wonder how many of them are synonyms with don't fail allocating, too. Hm. Adrian ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
I think very few of the m_nowaits actually need the reserve behavior. We should probably switch away from it digging that deep by default and introduce a flag and/or a per thread flag to set the behavior. Sent from my iPhone On Nov 11, 2012, at 4:32 PM, Dieter BSD dieter...@engineer.com wrote: Alan writes: In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. find /usr/src/sys | xargs grep M_NOWAIT | wc -l 2101 Sounds like a lot of work that would need to happen atomically. Would this work: M_NO_WAIT do not sleep, do not dig deep unless M_USE_RESERVE also set M_USE_RESERVE dig deep M_NOWAITM_NO_WAIT | M_USE_RESERVE (deprecated) New code avoids using M_NOWAIT. Existing code continues working the same way. As time permits, old code is converted to new flags. Eventually M_NOWAIT goes away. Pro: the amount of code that needs to change atomically is much smaller. Con: (1) Have to remember (or look up) difference between M_NOWAIT and M_NO_WAIT. Maybe calling the new flag M_NO_SLEEP would help? (2) Would M_NOWAIT really ever go away? The spl() calls haven't, even after some cage rattling. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote: I have a memory subsystem design question that I'm hoping someone can answer. I've been looking at a machine that is completely out of memory, as in v_free_count = 0, v_cache_count = 0, I wondered how a machine could completely run out of memory like this, especially after finding a lack of interrupt storms or other pathologies that would tend to overcommit memory. So I started investigating. Most allocators come down to vm_page_alloc(), which has this guard: if ((curproc == pageproc) (page_req != VM_ALLOC_INTERRUPT)) { page_req = VM_ALLOC_SYSTEM; }; if (cnt.v_free_count + cnt.v_cache_count cnt.v_free_reserved || (page_req == VM_ALLOC_SYSTEM cnt.v_free_count + cnt.v_cache_count cnt.v_interrupt_free_min) || (page_req == VM_ALLOC_INTERRUPT cnt.v_free_count + cnt.v_cache_count 0)) { The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every last page. From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, perhaps only used from interrupt threads. Not so, see kmem_malloc() or uma_small_alloc() which both contain this mapping: if ((flags (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; Note that M_USE_RESERVE has been deprecated and is used in just a handful of places. Also note that lots of code paths come through these routines. What this means is essentially _any_ allocation using M_NOWAIT will bypass whatever reserves have been held back and will take every last page available. There is no documentation stating M_NOWAIT has this side effect of essentially being privileged, so any innocuous piece of code that can't block will use it. And of course M_NOWAIT is literally used all over. It looks to me like the design goal of the BSD allocators is on recovery; it will give all pages away knowing it can recover. Am I missing anything? I would have expected some small number of pages to be held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of back door for grabbing memory. Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. pgpXUAix5bcxa.pgp Description: PGP signature