Re: Increasing TCP TSO size support
On Fri, Feb 2, 2024 at 6:20 PM Drew Gallatin wrote: > > > > On Fri, Feb 2, 2024, at 9:05 PM, Rick Macklem wrote: > > > But the page size is only 4K on most platforms. So while an M_EXTPGS mbuf > > can hold 5 pages (..from memory, too lazy to do the math right now) and > > reduces socket buffer mbuf chain lengths by a factor of 10 or so (2k vs 20k > > per mbuf), the S/G list that a NIC will need to consume would likely > > decrease only by a factor of 2. And even then only if the busdma code to > > map mbufs for DMA is not coalescing adjacent mbufs. I know busdma does > > some coalescing, but I can't recall if it coalesces physcally adjacent > > mbufs. > > I'm guessing the factor of 2 comes from the fact that each page is a > contiguous segment? > > > Actually, no, I'm being dumb. I was thinking that pages would be split up, > but that's wrong. Without M_EXTPGS, each mbuf generated by sendfile (or nfs) > would be an M_EXT with a wrapper around a single 4K page. So the > scatter/gather list would be exactly the same. > > The win would be if the pages themselves were contiguous (which they often > are), and if the bus_dma mbuf mapping code coalesced those segments, and if > the device could handle DMA across a 4K boundary. That's what would get you > shorter s/g lists. > > I think tcp_m_copy() can handle this now, as if_hw_tsomaxsegsize is set by > the driver to express how long the max contiguous segment they can handle is. Sounds good. I'll give it a try someday soon (April maybe). Thanks for all the good info, rick > > BTW, I really hate the mixing of bus dma restrictions with the hw_tsomax > stuff. It always makes my head explode.. > > Drew >
Re: Increasing TCP TSO size support
On Fri, Feb 2, 2024, at 9:05 PM, Rick Macklem wrote: > > But the page size is only 4K on most platforms. So while an M_EXTPGS mbuf > > can hold 5 pages (..from memory, too lazy to do the math right now) and > > reduces socket buffer mbuf chain lengths by a factor of 10 or so (2k vs 20k > > per mbuf), the S/G list that a NIC will need to consume would likely > > decrease only by a factor of 2. And even then only if the busdma code to > > map mbufs for DMA is not coalescing adjacent mbufs. I know busdma does > > some coalescing, but I can't recall if it coalesces physcally adjacent > > mbufs. > > I'm guessing the factor of 2 comes from the fact that each page is a > contiguous segment? Actually, no, I'm being dumb. I was thinking that pages would be split up, but that's wrong. Without M_EXTPGS, each mbuf generated by sendfile (or nfs) would be an M_EXT with a wrapper around a single 4K page. So the scatter/gather list would be exactly the same. The win would be if the pages themselves were contiguous (which they often are), and if the bus_dma mbuf mapping code coalesced those segments, and if the device could handle DMA across a 4K boundary. That's what would get you shorter s/g lists. I think tcp_m_copy() can handle this now, as if_hw_tsomaxsegsize is set by the driver to express how long the max contiguous segment they can handle is. BTW, I really hate the mixing of bus dma restrictions with the hw_tsomax stuff. It always makes my head explode.. Drew
Re: Increasing TCP TSO size support
On Fri, Feb 2, 2024 at 4:48 PM Drew Gallatin wrote: > > > > On Fri, Feb 2, 2024, at 6:13 PM, Rick Macklem wrote: > > A factor here is the if_hw_tsomaxsegcount limit. For example, a 1Mbyte NFS > write request > or read reply will result in a 514 element mbuf chain. Each of these (mostly > 2K mbuf clusters) > are non-contiguous data segments. (I suspect most NICs do not handle this > many segments well, > if at all.) > > > Excellent point > > > The NFS code does know how to use M_EXTPG mbufs (for NFS over TLS, for the > ktls), but I do not > know what it would take to make these work for non-KTLS TSO? > > > > Sendfile already uses M_EXTPG mbufs... When I was initially doing M_EXTPG > stuff for kTLS, I added support for using M_EXTPG mbufs in sendfile > regardless of whether or not kTLS was in use. That reduced CPU use > marginally on 64-bit platforms (due to reducing socket buffer lengths, and > hence reducing pointer chasing), and quite a bit more on 32-bit platforms > (due to also not needing to map memory into the kernel map, and by reducing > pointer chasing even more, as more pages fit into an M_EXTPG mbuf when a > paddr_t is 32-bits. > > > I do not know how the TSO loop in tcp_output handles M_EXTPG mbufs. > Does it assume each M_EXTPG mbuf is one contiguous data segment? > > > No, its fully aware of how to handle M_EXTPG mbufs. Look at tcp_m_copy() We > added code in the segment counting part of that function to count the > hdr/trailer parts of an M_EXTPG mbuf, and to deal with the start/end page > being misaligned. > > I do see that ip_output() will call mb_unmapped_to_ext() when the NIC does > not have IFCAP_MEXTPG set. > (If IFCAP_MEXTPG is set, do the pages need to be contiguous so that it can > become > a single contiguous data segment for TSO or ???) > > > No, it just means that a NIC driver has been verified to call not mtod() an > M_EXTPGS mbuf and deref the resulting data pointer. (which would make it go > "boom"). > > But the page size is only 4K on most platforms. So while an M_EXTPGS mbuf > can hold 5 pages (..from memory, too lazy to do the math right now) and > reduces socket buffer mbuf chain lengths by a factor of 10 or so (2k vs 20k > per mbuf), the S/G list that a NIC will need to consume would likely decrease > only by a factor of 2. And even then only if the busdma code to map mbufs > for DMA is not coalescing adjacent mbufs. I know busdma does some > coalescing, but I can't recall if it coalesces physcally adjacent mbufs. I'm guessing the factor of 2 comes from the fact that each page is a contiguous segment? The NFS code could easily use 5 contiguous pages, so maybe it would be worthwhile to try and make some NIC drivers capable of handling contiguous pages as one segment for TSO output? (It means that tcp_outpout() would need to know this case was possible, Maybe a new if_hw_tsoXX that covers the max number of segments if pages are contig?) However, given your previous post, it might not matter much, since the larger TSO segment might not make much difference? > > If TSO and the code beneath it (NIC and maybe mb_unmapped_to_ext() being > called) were to > all work ok for M_EXTPG mbufs, it would be easy to enable that for NFS > (non-TLS case). > > > > It does. You should enable it for at least TCP. Good work!! I will try it someday relatively soon. Even if it only reduces the use of mbuf clusters, that sounds like it would be worthwhile. rick > > Drew
Re: Increasing TCP TSO size support
On Fri, Feb 2, 2024, at 6:13 PM, Rick Macklem wrote: > A factor here is the if_hw_tsomaxsegcount limit. For example, a 1Mbyte NFS > write request > or read reply will result in a 514 element mbuf chain. Each of these (mostly > 2K mbuf clusters) > are non-contiguous data segments. (I suspect most NICs do not handle this > many segments well, > if at all.) Excellent point > > The NFS code does know how to use M_EXTPG mbufs (for NFS over TLS, for the > ktls), but I do not > know what it would take to make these work for non-KTLS TSO? Sendfile already uses M_EXTPG mbufs... When I was initially doing M_EXTPG stuff for kTLS, I added support for using M_EXTPG mbufs in sendfile regardless of whether or not kTLS was in use. That reduced CPU use marginally on 64-bit platforms (due to reducing socket buffer lengths, and hence reducing pointer chasing), and quite a bit more on 32-bit platforms (due to also not needing to map memory into the kernel map, and by reducing pointer chasing even more, as more pages fit into an M_EXTPG mbuf when a paddr_t is 32-bits. > I do not know how the TSO loop in tcp_output handles M_EXTPG mbufs. > Does it assume each M_EXTPG mbuf is one contiguous data segment? No, its fully aware of how to handle M_EXTPG mbufs. Look at tcp_m_copy(). We added code in the segment counting part of that function to count the hdr/trailer parts of an M_EXTPG mbuf, and to deal with the start/end page being misaligned. > I do see that ip_output() will call mb_unmapped_to_ext() when the NIC does > not have IFCAP_MEXTPG set. > (If IFCAP_MEXTPG is set, do the pages need to be contiguous so that it can > become > a single contiguous data segment for TSO or ???) No, it just means that a NIC driver has been verified to call not mtod() an M_EXTPGS mbuf and deref the resulting data pointer. (which would make it go "boom"). But the page size is only 4K on most platforms. So while an M_EXTPGS mbuf can hold 5 pages (..from memory, too lazy to do the math right now) and reduces socket buffer mbuf chain lengths by a factor of 10 or so (2k vs 20k per mbuf), the S/G list that a NIC will need to consume would likely decrease only by a factor of 2. And even then only if the busdma code to map mbufs for DMA is not coalescing adjacent mbufs. I know busdma does some coalescing, but I can't recall if it coalesces physcally adjacent mbufs. > If TSO and the code beneath it (NIC and maybe mb_unmapped_to_ext() being > called) were to > all work ok for M_EXTPG mbufs, it would be easy to enable that for NFS > (non-TLS case). It does. You should enable it for at least TCP. Drew
Re: Increasing TCP TSO size support
On Fri, Feb 2, 2024 at 1:21 AM Scheffenegger, Richard wrote: > > Hi, > > We have run a test for a RPC workload with 1MB IO sizes, and collected the > tcp_default_output() len(gth) during the first pass in the output loop. > > In such a scenario, where the application frequently introduces small > pauses (since the next large IO is only sent after the corresponding > request from the client has been received and processed) between sending > additional data, the current TSO limit of 64kB TSO maximum (45*1448 in > effect) requires multiple passes in the output routine to send all the > allowable (cwnd limited) data. > > I'll try to get a data collection with better granulariy above 90 000 > bytes - but even here the average strongly indicates that a majority of > transmission opportunities are in the 512 kB area - probably also having to > do with LRO and ACK thinning effects by the client. > > With other words, the tcp output has to run about 9 times with TSO, to > transmit all elegible data - increasing the FreeBSD supported maximum TSO > size to what current hardware could handle (256kB..1MB) would reduce the > CPU burden here. > > > Is increasing the sofware supported TSO size to allow for what the NICs > could nowadays do something anyone apart from us would be interested in (in > particular, those who work with the drivers)? > Reposted after joining freebsd-net@... A factor here is the if_hw_tsomaxsegcount limit. For example, a 1Mbyte NFS write request or read reply will result in a 514 element mbuf chain. Each of these (mostly 2K mbuf clusters) are non-contiguous data segments. (I suspect most NICs do not handle this many segments well, if at all.) The NFS code does know how to use M_EXTPG mbufs (for NFS over TLS, for the ktls), but I do not know what it would take to make these work for non-KTLS TSO? I do not know how the TSO loop in tcp_output handles M_EXTPG mbufs. Does it assume each M_EXTPG mbuf is one contiguous data segment? I do see that ip_output() will call mb_unmapped_to_ext() when the NIC does not have IFCAP_MEXTPG set. (If IFCAP_MEXTPG is set, do the pages need to be contiguous so that it can become a single contiguous data segment for TSO or ???) If TSO and the code beneath it (NIC and maybe mb_unmapped_to_ext() being called) were to all work ok for M_EXTPG mbufs, it would be easy to enable that for NFS (non-TLS case). I do not want to hijack this thread, but do others know how TSO interacts with M_EXTPG mbufs? rick > Best regards, > > Richard > > > > > tso size (transmissions < 1448 would not be accounted here at all) > > # count > > <1000 0 > <2000 23 > <3000 111 > <4000 40 > <5000 30 > <7000 14 > <8000 134 > <9000 442 > <1 9396 > <2 46227 > <3 25646 > <4 33060 > <6 23162 > <7 24368 > <8 19772 > <9 40101 > >=9 75384169 > Average: 578844.44 > > CAUTION: This email originated from outside of the University of Guelph. > Do not click links or open attachments unless you recognize the sender and > know the content is safe. If in doubt, forward suspicious emails to > ith...@uoguelph.ca. > >
Re: Increasing TCP TSO size support
What is the link speed that you're working with? A long time ago, when I worked for a now-defunct 10GbE NIC vendor, I experimented with the benefits of TSO as we varied the max TSO size. I cannot recall the platform (it could have been OSX, Solaris, FreeBSD or Linux). At the time (~2006?) the CPU saving benefits of increasing the max TSO size from 8k to 64k was fairly minimal.In fact, I seem to recall that there was almost no benefit to TSO sizes larger than 16K. I was wondering if you see any difference in your benchmark if you cap max TSO size to 8k, 16k,32k, and the default of 64k. Any change in CPU use, or in your benchmark's performance would be interesting to hear about. Naively, I'd expect the benchmark performance to remain unchanged until you'd reduced the TSO size so much as to make the host slower than the wire, thereby inserting gaps between TSOs. That would be reflected in the CPU use as well.. Drew On Fri, Feb 2, 2024, at 4:21 AM, Scheffenegger, Richard wrote: > > > Hi, > > We have run a test for a RPC workload with 1MB IO sizes, and collected the > tcp_default_output() len(gth) during the first pass in the output loop. > > In such a scenario, where the application frequently introduces small pauses > (since the next large IO is only sent after the corresponding request from > the client has been received and processed) between sending additional data, > the current TSO limit of 64kB TSO maximum (45*1448 in effect) requires > multiple passes in the output routine to send all the allowable (cwnd > limited) data. > > I'll try to get a data collection with better granulariy above 90 000 bytes - > but even here the average strongly indicates that a majority of transmission > opportunities are in the 512 kB area - probably also having to do with LRO > and ACK thinning effects by the client. > > With other words, the tcp output has to run about 9 times with TSO, to > transmit all elegible data - increasing the FreeBSD supported maximum TSO > size to what current hardware could handle (256kB..1MB) would reduce the CPU > burden here. > > > > Is increasing the sofware supported TSO size to allow for what the NICs could > nowadays do something anyone apart from us would be interested in (in > particular, those who work with the drivers)? > > > > Best regards, > > Richard > > > > > > > > tso size (transmissions < 1448 would not be accounted here at all) > > # count > > > > <1000 > 0 > <2000 > 23 > <3000 > 111 > <4000 > 40 > <5000 > 30 > <7000 > 14 > <8000 > 134 > <9000 > 442 > <1 > 9396 > <2 > 46227 > <3 > 25646 > <4 > 33060 > <6 > 23162 > <7 > 24368 > <8 > 19772 > <9 > 40101 > >=9 > 75384169 > Average: > 578844.44 > > *Attachments:* > • OpenPGP_0x17BE5899E0B1439B.asc > • OpenPGP_signature.asc