Re: [PATCH] let the mbufs use more then 4gb of memory
> Date: Fri, 12 Aug 2016 14:26:34 +0200 > From: Claudio Jeker > > On Fri, Aug 12, 2016 at 04:38:45PM +1000, David Gwynne wrote: > > > > > On 1 Aug 2016, at 21:07, Simon Mages wrote: > > > > > > I sent this message to dlg@ directly to discuss my modification of his > > > diff to make the > > > bigger mbuf clusters work. i got no response so far, thats why i > > > decided to post it on tech@ > > > directly. Maybe this way i get faster some feedback :) > > > > hey simon, > > > > i was travelling when you sent your mail to me and then it fell out of my > > head. sorry about that. > > > > if this is working correctly then i would like to put it in the tree. from > > the light testing i have done, it is working correctly. would anyone object? > > > > some performance measurement would also be interesting :) > > > > I would prefer we take the diff I started at n2k16. I need to dig it out > though. I think the subject of the thread has become misleading. At least the diff I think David and Simon are talking about is about using the larger mbuf pools for socket buffers and no longer about using memory >4G for them. David, Simon, best to start all over again, and repost the diff with a proper subject and explanation. You shouldn't be forcing other developers to read through several pages of private conversations.
Re: [PATCH] let the mbufs use more then 4gb of memory
On Fri, Aug 12, 2016 at 04:38:45PM +1000, David Gwynne wrote: > > > On 1 Aug 2016, at 21:07, Simon Mages wrote: > > > > I sent this message to dlg@ directly to discuss my modification of his > > diff to make the > > bigger mbuf clusters work. i got no response so far, thats why i > > decided to post it on tech@ > > directly. Maybe this way i get faster some feedback :) > > hey simon, > > i was travelling when you sent your mail to me and then it fell out of my > head. sorry about that. > > if this is working correctly then i would like to put it in the tree. from > the light testing i have done, it is working correctly. would anyone object? > > some performance measurement would also be interesting :) > I would prefer we take the diff I started at n2k16. I need to dig it out though. -- :wq Claudio
Re: [PATCH] let the mbufs use more then 4gb of memory
> From: David Gwynne > Date: Fri, 12 Aug 2016 16:38:45 +1000 > > > On 1 Aug 2016, at 21:07, Simon Mages wrote: > > > > I sent this message to dlg@ directly to discuss my modification of his > > diff to make the > > bigger mbuf clusters work. i got no response so far, thats why i > > decided to post it on tech@ > > directly. Maybe this way i get faster some feedback :) > > hey simon, > > i was travelling when you sent your mail to me and then it fell out > of my head. sorry about that. > > if this is working correctly then i would like to put it in the tree. from > the light testing i have done, it is working correctly. would anyone object? > > some performance measurement would also be interesting :) Hmm, during debugging I've relied on the fact that only drivers allocate the larger mbuf clusters for their rx rings. Anyway, shouldn't the diff be using ulmin()? > dlg > > > > > BR > > Simon > > > > ### Original Mail: > > > > ------ Forwarded message ------ > > From: Simon Mages > > Date: Fri, 22 Jul 2016 13:24:24 +0200 > > Subject: Re: [PATCH] let the mbufs use more then 4gb of memory > > To: David Gwynne > > > > Hi, > > > > I think i found the problem with your diff regarding the bigger mbuf > > clusters. > > > > You choose a buffer size based on space and resid, but what happens when > > resid > > is larger then space and space is for example 2050? The cluster choosen has > > then the size 4096. But this size is to large for the socket buffer. In the > > past this was never a problem because you only allocated external clusters > > of size MCLBYTES and this was only done when space was larger then MCLBYTES. > > > > diff: > > Index: kern/uipc_socket.c > > === > > RCS file: /cvs/src/sys/kern/uipc_socket.c,v > > retrieving revision 1.152 > > diff -u -p -u -p -r1.152 uipc_socket.c > > --- kern/uipc_socket.c 13 Jun 2016 21:24:43 - 1.152 > > +++ kern/uipc_socket.c 22 Jul 2016 10:56:02 - > > @@ -496,15 +496,18 @@ restart: > > mlen = MLEN; > > } > > if (resid >= MINCLSIZE && space >= MCLBYTES) { > > - MCLGET(m, M_NOWAIT); > > + MCLGETI(m, M_NOWAIT, NULL, lmin(resid, > > + lmin(space, MAXMCLBYTES))); > > if ((m->m_flags & M_EXT) == 0) > > goto nopages; > > if (atomic && top == 0) { > > - len = ulmin(MCLBYTES - max_hdr, > > - resid); > > + len = lmin(lmin(resid, space), > > + m->m_ext.ext_size - > > + max_hdr); > > m->m_data += max_hdr; > > } else > > - len = ulmin(MCLBYTES, resid); > > + len = lmin(lmin(resid, space), > > + m->m_ext.ext_size); > > space -= len; > > } else { > > nopages: > > > > Im using this diff no for a while on my notebook and everything works as > > expected. But i had no time to realy test it or test the performance. This > > will > > be my next step. > > > > I reproduced the unix socket problem you mentioned with the following little > > programm: > > > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > > > #include > > #include > > #include > > > > #define FILE "/tmp/afile" > > > > int senddesc(int fd, int so); > > int recvdesc(int so); > > > > int > > main(void) > > { > > struct stat sb; > > int sockpair[2]; > > pid_t pid = 0; > > int status; > > int newfile; > > > > if (unlink(FILE) < 0) > > warn("unlink: %s", FILE); > > > > int file = open(FILE, O
Re: [PATCH] let the mbufs use more then 4gb of memory
On 2016-06-23 05:42, Theo de Raadt wrote: secondly, allocating more than 4g at a time to socket buffers is generally a waste of memory. and there is one further problem. Eventually, this subsystem will starve the system. Other subsystems which also need large amounts of memory, then have to scramble. There have to be backpressure mechanisms in each subsystem to force out memory. There is no such mechanism in socket buffers. The mechanisms in the remaining parts of the kernel have always proven to be weak, as in, they don't interact as nicely as we want, to create space. There has been much work to make them work better. However in socket buffers, there is no such mechanism. What are you going to do. Throw data away? You can't do that. Therefore, you are holding the remaining system components hostage, and your diff creates deadlock. You probably tested your diff under ideal conditions with gobs of memory... The backpressure mechanism to free up [disk IO] buffer cache content is really effective though, so 90 is a mostly suitable bufcachepercent sysctl setting, right?
Re: [PATCH] let the mbufs use more then 4gb of memory
> On 1 Aug 2016, at 21:07, Simon Mages wrote: > > I sent this message to dlg@ directly to discuss my modification of his > diff to make the > bigger mbuf clusters work. i got no response so far, thats why i > decided to post it on tech@ > directly. Maybe this way i get faster some feedback :) hey simon, i was travelling when you sent your mail to me and then it fell out of my head. sorry about that. if this is working correctly then i would like to put it in the tree. from the light testing i have done, it is working correctly. would anyone object? some performance measurement would also be interesting :) dlg > > BR > Simon > > ### Original Mail: > > -- Forwarded message -- > From: Simon Mages > Date: Fri, 22 Jul 2016 13:24:24 +0200 > Subject: Re: [PATCH] let the mbufs use more then 4gb of memory > To: David Gwynne > > Hi, > > I think i found the problem with your diff regarding the bigger mbuf clusters. > > You choose a buffer size based on space and resid, but what happens when resid > is larger then space and space is for example 2050? The cluster choosen has > then the size 4096. But this size is to large for the socket buffer. In the > past this was never a problem because you only allocated external clusters > of size MCLBYTES and this was only done when space was larger then MCLBYTES. > > diff: > Index: kern/uipc_socket.c > === > RCS file: /cvs/src/sys/kern/uipc_socket.c,v > retrieving revision 1.152 > diff -u -p -u -p -r1.152 uipc_socket.c > --- kern/uipc_socket.c13 Jun 2016 21:24:43 - 1.152 > +++ kern/uipc_socket.c22 Jul 2016 10:56:02 - > @@ -496,15 +496,18 @@ restart: > mlen = MLEN; > } > if (resid >= MINCLSIZE && space >= MCLBYTES) { > - MCLGET(m, M_NOWAIT); > + MCLGETI(m, M_NOWAIT, NULL, lmin(resid, > + lmin(space, MAXMCLBYTES))); > if ((m->m_flags & M_EXT) == 0) > goto nopages; > if (atomic && top == 0) { > - len = ulmin(MCLBYTES - max_hdr, > - resid); > + len = lmin(lmin(resid, space), > + m->m_ext.ext_size - > + max_hdr); > m->m_data += max_hdr; > } else > - len = ulmin(MCLBYTES, resid); > + len = lmin(lmin(resid, space), > + m->m_ext.ext_size); > space -= len; > } else { > nopages: > > Im using this diff no for a while on my notebook and everything works as > expected. But i had no time to realy test it or test the performance. This > will > be my next step. > > I reproduced the unix socket problem you mentioned with the following little > programm: > > #include > #include > #include > #include > #include > #include > #include > > #include > #include > #include > > #define FILE "/tmp/afile" > > int senddesc(int fd, int so); > int recvdesc(int so); > > int > main(void) > { > struct stat sb; > int sockpair[2]; > pid_t pid = 0; > int status; > int newfile; > > if (unlink(FILE) < 0) > warn("unlink: %s", FILE); > > int file = open(FILE, O_RDWR|O_CREAT|O_TRUNC); > > if (socketpair(AF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0, sockpair) < 0) > err(1, "socketpair"); > > if ((pid =fork())) { > senddesc(file, sockpair[0]); > if (waitpid(pid, &status, 0) < 0) > err(1, "waitpid"); > } else { > newfile = recvdesc(sockpair[1]); > if (fstat(newfile, &sb) < 0) > err(1, "fstat"); > } > > return 0; > } > > int > senddesc(int fd, int so) > { > struct msghdr msg; > struct cmsghdr *cmsg; > union { > struct cmsghdr hdr; >
Re: [PATCH] let the mbufs use more then 4gb of memory
On Thu, Jun 23, 2016 at 02:41:53PM +0200, Mark Kettenis wrote: > > Date: Thu, 23 Jun 2016 13:09:28 +0200 > > From: Alexander Bluhm > > > > On Wed, Jun 22, 2016 at 10:54:27PM +1000, David Gwynne wrote: > > > secondly, allocating more than 4g at a time to socket buffers is > > > generally a waste of memory. in practice you should scale the amount > > > of memory available to sockets according to the size of the tcp > > > windows you need to saturate the bandwidth available to the box. > > > > Currently OpenBSD limits the socket buffer size to 256k. > > #define SB_MAX (256*1024) /* default for max chars in sockbuf > > */ > > > > For downloading large files from the internet this is not sufficinet > > anymore. After customer complaints we have increased the limit to > > 1MB. This still does not give maximum throughput, but granting > > more could easily result in running out of mbufs. 16MB would be > > sufficent. > > > > Besides from single connections with high throughput we also have > > a lot of long running connections, say some 1. Each connection > > over a relay needs two sockets and four socket buffers. With 1MB > > limit and 1 connections the theoretical maximum is 40GB. > > > > It is hard to figure out which connections need socket buffer space > > in advance. tcp_update_{snd,rcv}space() adjusts it dynamically, > > there sbchecklowmem() has a first come first serve policy. Another > > challenge is, that the peers on both sides of the relay can decide > > wether they fill our buffers. > > > > Besides from finding a smarter algorithm to distribute the socket > > buffer space, increasing the number of mbufs could be a solution. > > Our server machines mostly relay connection data, there I seems > > seductive to use much more mbuf memory to speed up TCP connetions. > > Without 64 bit DMA most memory of the machine is unused. > > > > Also modern BIOS maps only 2GB in low region. All DMA devices must > > share these. Putting mbufs high should reduce pressure. > > > > Of course there are problems with network adaptors that support > > less DMA space and with hotplug configurations. For a general > > solution we can implement bounce buffers, disable the feature on > > such machines or have a knob. > > We really don't want to implement bounce-buffers. Adding IOMMU > support is probably a better approach as it also brings some security > benefits. Not all amd64 hardware supports an IOMMU. And hardware > that does support it doesn't always have it enabled. But for modern > hardware an iommu is pretty much standard, except for the absolute > low-end. But those low-end machines tend to have only 2GB of memory > anyway. Another option is to use m_defrag() to move the mbuf from high mem down in case it is needed. I think this is much simpler to implement and devices that need it can be identified fairly easy. This only solves the TX side on the RX side the bouncing would need to be done in the socketbuffers (it would make sense to use large mclusters in socketbuffers and copy the data over. -- :wq Claudio
Re: [PATCH] let the mbufs use more then 4gb of memory
On Thursday 23 June 2016 14:41:53, Mark Kettenis wrote: > We really don't want to implement bounce-buffers. Adding IOMMU > support is probably a better approach as it also brings some > security benefits. Not all amd64 hardware supports an IOMMU. And > hardware that does support it doesn't always have it enabled. But > for modern hardware an iommu is pretty much standard, except for > the absolute low-end. But those low-end machines tend to have only > 2GB of memory anyway. On amd64, modern would mean skylake or newer. At least until haswell (not sure about broadwell), Intel considered vt-d to be a high-end feature and many desktop CPUs don't have it enabled. It is easy to find systems with >=16 GB RAM without IOMMU. Stefan
Re: [PATCH] let the mbufs use more then 4gb of memory
Mark Kettenis [mark.kette...@xs4all.nl] wrote: > > We really don't want to implement bounce-buffers. Adding IOMMU > support is probably a better approach as it also brings some security > benefits. Not all amd64 hardware supports an IOMMU. And hardware > that does support it doesn't always have it enabled. But for modern > hardware an iommu is pretty much standard, except for the absolute > low-end. But those low-end machines tend to have only 2GB of memory > anyway. Is the sparc64 iommu code port usable for this purpose? http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/arch/amd64/amd64/Attic/sg_dma.c
Re: [PATCH] let the mbufs use more then 4gb of memory
> Date: Thu, 23 Jun 2016 13:09:28 +0200 > From: Alexander Bluhm > > On Wed, Jun 22, 2016 at 10:54:27PM +1000, David Gwynne wrote: > > secondly, allocating more than 4g at a time to socket buffers is > > generally a waste of memory. in practice you should scale the amount > > of memory available to sockets according to the size of the tcp > > windows you need to saturate the bandwidth available to the box. > > Currently OpenBSD limits the socket buffer size to 256k. > #define SB_MAX (256*1024) /* default for max chars in sockbuf */ > > For downloading large files from the internet this is not sufficinet > anymore. After customer complaints we have increased the limit to > 1MB. This still does not give maximum throughput, but granting > more could easily result in running out of mbufs. 16MB would be > sufficent. > > Besides from single connections with high throughput we also have > a lot of long running connections, say some 1. Each connection > over a relay needs two sockets and four socket buffers. With 1MB > limit and 1 connections the theoretical maximum is 40GB. > > It is hard to figure out which connections need socket buffer space > in advance. tcp_update_{snd,rcv}space() adjusts it dynamically, > there sbchecklowmem() has a first come first serve policy. Another > challenge is, that the peers on both sides of the relay can decide > wether they fill our buffers. > > Besides from finding a smarter algorithm to distribute the socket > buffer space, increasing the number of mbufs could be a solution. > Our server machines mostly relay connection data, there I seems > seductive to use much more mbuf memory to speed up TCP connetions. > Without 64 bit DMA most memory of the machine is unused. > > Also modern BIOS maps only 2GB in low region. All DMA devices must > share these. Putting mbufs high should reduce pressure. > > Of course there are problems with network adaptors that support > less DMA space and with hotplug configurations. For a general > solution we can implement bounce buffers, disable the feature on > such machines or have a knob. We really don't want to implement bounce-buffers. Adding IOMMU support is probably a better approach as it also brings some security benefits. Not all amd64 hardware supports an IOMMU. And hardware that does support it doesn't always have it enabled. But for modern hardware an iommu is pretty much standard, except for the absolute low-end. But those low-end machines tend to have only 2GB of memory anyway.
Re: [PATCH] let the mbufs use more then 4gb of memory
On Wed, Jun 22, 2016 at 10:54:27PM +1000, David Gwynne wrote: > secondly, allocating more than 4g at a time to socket buffers is > generally a waste of memory. in practice you should scale the amount > of memory available to sockets according to the size of the tcp > windows you need to saturate the bandwidth available to the box. Currently OpenBSD limits the socket buffer size to 256k. #define SB_MAX (256*1024) /* default for max chars in sockbuf */ For downloading large files from the internet this is not sufficinet anymore. After customer complaints we have increased the limit to 1MB. This still does not give maximum throughput, but granting more could easily result in running out of mbufs. 16MB would be sufficent. Besides from single connections with high throughput we also have a lot of long running connections, say some 1. Each connection over a relay needs two sockets and four socket buffers. With 1MB limit and 1 connections the theoretical maximum is 40GB. It is hard to figure out which connections need socket buffer space in advance. tcp_update_{snd,rcv}space() adjusts it dynamically, there sbchecklowmem() has a first come first serve policy. Another challenge is, that the peers on both sides of the relay can decide wether they fill our buffers. Besides from finding a smarter algorithm to distribute the socket buffer space, increasing the number of mbufs could be a solution. Our server machines mostly relay connection data, there I seems seductive to use much more mbuf memory to speed up TCP connetions. Without 64 bit DMA most memory of the machine is unused. Also modern BIOS maps only 2GB in low region. All DMA devices must share these. Putting mbufs high should reduce pressure. Of course there are problems with network adaptors that support less DMA space and with hotplug configurations. For a general solution we can implement bounce buffers, disable the feature on such machines or have a knob. bluhm
Re: [PATCH] let the mbufs use more then 4gb of memory
> secondly, allocating more than 4g at a time to socket buffers is > generally a waste of memory. and there is one further problem. Eventually, this subsystem will starve the system. Other subsystems which also need large amounts of memory, then have to scramble. There have to be backpressure mechanisms in each subsystem to force out memory. There is no such mechanism in socket buffers. The mechanisms in the remaining parts of the kernel have always proven to be weak, as in, they don't interact as nicely as we want, to create space. There has been much work to make them work better. However in socket buffers, there is no such mechanism. What are you going to do. Throw data away? You can't do that. Therefore, you are holding the remaining system components hostage, and your diff creates deadlock. You probably tested your diff under ideal conditions with gobs of memory...
Re: [PATCH] let the mbufs use more then 4gb of memory
On Wed, Jun 22, 2016 at 01:58:25PM +0200, Simon Mages wrote: > On a System where you use the maximum socketbuffer size of 256kbyte you > can run out of memory after less then 9k open sockets. > > My patch adds a new uvm_constraint for the mbufs with a bigger memory area. > I choose this area after reading the comments in > sys/arch/amd64/include/pmap.h. > This patch further changes the maximum sucketbuffer size from 256k to 1gb as > it is described in the rfc1323 S2.3. You read that RFC wrong. I see no reason to increase the socketbuffer size to such a huge value. A change like this is currently not acceptable. > I tested this diff with the ix, em and urndis driver. I know that this > diff only works > for amd64 right now, but i wanted to send this diff as a proposal what could > be > done. Maybe somebody has a different solution for this Problem or can me why > this is a bad idea. > Are you sure that all drivers are able to handle memory with physical addresses that are more than 32bit long? I doubt this. I think a lot more is needed than this diff to make this work even just for amd64. > > Index: arch/amd64/amd64/bus_dma.c > === > RCS file: /openbsd/src/sys/arch/amd64/amd64/bus_dma.c,v > retrieving revision 1.49 > diff -u -p -u -p -r1.49 bus_dma.c > --- arch/amd64/amd64/bus_dma.c17 Dec 2015 17:16:04 - 1.49 > +++ arch/amd64/amd64/bus_dma.c22 Jun 2016 11:33:17 - > @@ -584,7 +584,7 @@ _bus_dmamap_load_buffer(bus_dma_tag_t t, >*/ > pmap_extract(pmap, vaddr, (paddr_t *)&curaddr); > > - if (curaddr > dma_constraint.ucr_high) > + if (curaddr > mbuf_constraint.ucr_high) > panic("Non dma-reachable buffer at curaddr %#lx(raw)", > curaddr); > > Index: arch/amd64/amd64/machdep.c > === > RCS file: /openbsd/src/sys/arch/amd64/amd64/machdep.c,v > retrieving revision 1.221 > diff -u -p -u -p -r1.221 machdep.c > --- arch/amd64/amd64/machdep.c21 May 2016 00:56:43 - 1.221 > +++ arch/amd64/amd64/machdep.c22 Jun 2016 11:33:17 - > @@ -202,9 +202,11 @@ struct vm_map *phys_map = NULL; > /* UVM constraint ranges. */ > struct uvm_constraint_range isa_constraint = { 0x0, 0x00ffUL }; > struct uvm_constraint_range dma_constraint = { 0x0, 0xUL }; > +struct uvm_constraint_range mbuf_constraint = { 0x0, 0xfUL }; > struct uvm_constraint_range *uvm_md_constraints[] = { > &isa_constraint, > &dma_constraint, > +&mbuf_constraint, > NULL, > }; > > Index: kern/uipc_mbuf.c > === > RCS file: /openbsd/src/sys/kern/uipc_mbuf.c,v > retrieving revision 1.226 > diff -u -p -u -p -r1.226 uipc_mbuf.c > --- kern/uipc_mbuf.c 13 Jun 2016 21:24:43 - 1.226 > +++ kern/uipc_mbuf.c 22 Jun 2016 11:33:18 - > @@ -153,7 +153,7 @@ mbinit(void) > > pool_init(&mbpool, MSIZE, 0, 0, 0, "mbufpl", NULL); > pool_setipl(&mbpool, IPL_NET); > - pool_set_constraints(&mbpool, &kp_dma_contig); > + pool_set_constraints(&mbpool, &kp_mbuf_contig); > pool_setlowat(&mbpool, mblowat); > > pool_init(&mtagpool, PACKET_TAG_MAXSIZE + sizeof(struct m_tag), > @@ -166,7 +166,7 @@ mbinit(void) > pool_init(&mclpools[i], mclsizes[i], 0, 0, 0, > mclnames[i], NULL); > pool_setipl(&mclpools[i], IPL_NET); > - pool_set_constraints(&mclpools[i], &kp_dma_contig); > + pool_set_constraints(&mclpools[i], &kp_mbuf_contig); > pool_setlowat(&mclpools[i], mcllowat); > } > > Index: sys/socketvar.h > === > RCS file: /openbsd/src/sys/sys/socketvar.h,v > retrieving revision 1.60 > diff -u -p -u -p -r1.60 socketvar.h > --- sys/socketvar.h 25 Feb 2016 07:39:09 - 1.60 > +++ sys/socketvar.h 22 Jun 2016 11:33:18 - > @@ -112,7 +112,7 @@ struct socket { > short sb_flags; /* flags, see below */ > u_short sb_timeo; /* timeout for read/write */ > } so_rcv, so_snd; > -#define SB_MAX (256*1024) /* default for max chars in > sockbuf */ > +#define SB_MAX (1024*1024*1024)/* default for max chars in > sockbuf */ > #define SB_LOCK 0x01/* lock on data queue */ > #define SB_WANT 0x02/* someone is waiting to lock */ > #define SB_WAIT 0x04/* someone is waiting for > data/space */ > Index: uvm/uvm_extern.h > === > RCS file: /openbsd/src/sys/uvm/uvm_extern.h,v > retrieving revision 1.139 > diff -u -p -u -p -r1.139 uvm_extern.h > --- uvm/uvm_extern.h 5 Jun 2016 08:35:57 - 1.139 > +
Re: [PATCH] let the mbufs use more then 4gb of memory
On Wed, Jun 22, 2016 at 01:58:25PM +0200, Simon Mages wrote: > On a System where you use the maximum socketbuffer size of 256kbyte you > can run out of memory after less then 9k open sockets. > > My patch adds a new uvm_constraint for the mbufs with a bigger memory area. > I choose this area after reading the comments in > sys/arch/amd64/include/pmap.h. > This patch further changes the maximum sucketbuffer size from 256k to 1gb as > it is described in the rfc1323 S2.3. > > I tested this diff with the ix, em and urndis driver. I know that this > diff only works > for amd64 right now, but i wanted to send this diff as a proposal what could > be > done. Maybe somebody has a different solution for this Problem or can me why > this is a bad idea. hey simon, first, some background. the 4G watermark is less about limiting the amount of memory used by the network stack and more about making the memory addressable by as many devices, including network cards, as possible. we support older chips that only deal with 32 bit addresses (and one or two stupid ones with an inability to address over 1G), so we took the conservative option and made made the memory generally usable without developers having to think about it much. you could argue that if you should be able to give big addresses to modern cards, but that falls down if you are forwarding packets between a modern and old card, cos the old card will want to dma the packet the modern card rxed, but it needs it below the 4g line. even if you dont have an old card, in todays hotplug world you might plug an old device in. either way, the future of an mbuf is very hard for the kernel to predict. secondly, allocating more than 4g at a time to socket buffers is generally a waste of memory. in practice you should scale the amount of memory available to sockets according to the size of the tcp windows you need to saturate the bandwidth available to the box. this means if you want to sustain a gigabit of traffic with a 300ms round trip time for packets, you'd "only" need ~37.5 megabytes of buffers. to sustain 40 gigabit you'd need 1.5 gigabytes, which is still below 4G. allowing more use of memory for buffers would likely induce latency. the above means that if you want to sustain a single 40G tcp connection to that host you'd need to be able to place 1.5G on the socket buffer, which is above the 1G you mention above. however, if you want to sustain 2 connections, you ideally want to fairly share the 1.5G between both sockets. they should get 750M each. fairly sharing buffers between the sockets may already be in place in openbsd. when i reworked the pools subsystem i set it up so things sleeping on memory were woken up in order. it occurs to me that perhaps we should limit mbufs by the bytes they can use rather than the number of them. that would also work well if we moved to per cpu caches for mbufs and clusters, cos the number of active mbufs in the system becomes hard to limit accurately if we want cpus to run independently. if you want something to work on in this area, could you look at letting sockets use the "jumbo" clusters instead of assuming everything has to be in 2k clusters? i started on thsi with the diff below, but it broke ospfd and i never got back to it. if you get it working, it would be interested to test creating even bigger cluster pools, eg, a 1M or 4M mbuf cluster. cheers, dlg Index: uipc_socket.c === RCS file: /cvs/src/sys/kern/uipc_socket.c,v retrieving revision 1.135 diff -u -p -r1.135 uipc_socket.c --- uipc_socket.c 11 Dec 2014 19:21:57 - 1.135 +++ uipc_socket.c 22 Dec 2014 01:11:03 - @@ -493,15 +493,18 @@ restart: mlen = MLEN; } if (resid >= MINCLSIZE && space >= MCLBYTES) { - MCLGET(m, M_NOWAIT); + MCLGETI(m, M_NOWAIT, NULL, lmin(resid, + lmin(space, MAXMCLBYTES))); if ((m->m_flags & M_EXT) == 0) goto nopages; if (atomic && top == 0) { - len = lmin(MCLBYTES - max_hdr, - resid); + len = lmin(resid, + m->m_ext.ext_size - + max_hdr); m->m_data += max_hdr; } else - len = lmin(MCLBYTES, resid); + len = lmin(resid, + m->m_ext.ext_size);