Re: Increasing TCP TSO size support

2024-02-02 Thread Rick Macklem
On Fri, Feb 2, 2024 at 6:20 PM Drew Gallatin  wrote:
>
>
>
> On Fri, Feb 2, 2024, at 9:05 PM, Rick Macklem wrote:
>
> > But the page size is only 4K on most platforms.  So while an M_EXTPGS mbuf 
> > can hold 5 pages (..from memory, too lazy to do the math right now) and 
> > reduces socket buffer mbuf chain lengths by a factor of 10 or so (2k vs 20k 
> > per mbuf), the S/G list that a NIC will need to consume would likely 
> > decrease only by a factor of 2.  And even then only if the busdma code to 
> > map mbufs for DMA is not coalescing adjacent mbufs.  I know busdma does 
> > some coalescing, but I can't recall if it coalesces physcally adjacent 
> > mbufs.
>
> I'm guessing the factor of 2 comes from the fact that each page is a
> contiguous segment?
>
>
> Actually, no, I'm being dumb.  I was thinking that pages would be split up, 
> but that's wrong.  Without M_EXTPGS, each mbuf generated by sendfile (or nfs) 
> would be an M_EXT with a wrapper around a single 4K page.  So the 
> scatter/gather list would be exactly the same.
>
> The win would be if the pages themselves were contiguous (which they often 
> are), and if the bus_dma mbuf mapping code coalesced those segments, and if 
> the device could handle DMA across a 4K boundary.  That's what would get you 
> shorter s/g lists.
>
> I think tcp_m_copy() can handle this now, as if_hw_tsomaxsegsize is set by 
> the driver to express how long the max contiguous segment they can handle is.
Sounds good. I'll give it a try someday soon (April maybe).

Thanks for all the good info, rick

>
> BTW, I really hate the mixing of bus dma restrictions with the hw_tsomax 
> stuff.  It always makes my head explode..
>
> Drew
>



Re: Increasing TCP TSO size support

2024-02-02 Thread Drew Gallatin


On Fri, Feb 2, 2024, at 9:05 PM, Rick Macklem wrote:
> > But the page size is only 4K on most platforms.  So while an M_EXTPGS mbuf 
> > can hold 5 pages (..from memory, too lazy to do the math right now) and 
> > reduces socket buffer mbuf chain lengths by a factor of 10 or so (2k vs 20k 
> > per mbuf), the S/G list that a NIC will need to consume would likely 
> > decrease only by a factor of 2.  And even then only if the busdma code to 
> > map mbufs for DMA is not coalescing adjacent mbufs.  I know busdma does 
> > some coalescing, but I can't recall if it coalesces physcally adjacent 
> > mbufs.
> 
> I'm guessing the factor of 2 comes from the fact that each page is a
> contiguous segment?

Actually, no, I'm being dumb.  I was thinking that pages would be split up, but 
that's wrong.  Without M_EXTPGS, each mbuf generated by sendfile (or nfs) would 
be an M_EXT with a wrapper around a single 4K page.  So the scatter/gather list 
would be exactly the same.

The win would be if the pages themselves were contiguous (which they often 
are), and if the bus_dma mbuf mapping code coalesced those segments, and if the 
device could handle DMA across a 4K boundary.  That's what would get you 
shorter s/g lists.

I think tcp_m_copy() can handle this now, as if_hw_tsomaxsegsize is set by the 
driver to express how long the max contiguous segment they can handle is.

BTW, I really hate the mixing of bus dma restrictions with the hw_tsomax stuff. 
 It always makes my head explode..

Drew


Re: Increasing TCP TSO size support

2024-02-02 Thread Rick Macklem
On Fri, Feb 2, 2024 at 4:48 PM Drew Gallatin  wrote:
>
>
>
> On Fri, Feb 2, 2024, at 6:13 PM, Rick Macklem wrote:
>
>  A factor here is the if_hw_tsomaxsegcount limit. For example, a 1Mbyte NFS 
> write request
> or read reply will result in a 514 element mbuf chain. Each of these (mostly 
> 2K mbuf clusters)
> are non-contiguous data segments. (I suspect most NICs do not handle this 
> many segments well,
> if at all.)
>
>
> Excellent point
>
>
> The NFS code does know how to use M_EXTPG mbufs (for NFS over TLS, for the 
> ktls), but I do not
> know what it would take to make these work for non-KTLS TSO?
>
>
>
> Sendfile already uses M_EXTPG mbufs... When I was initially doing M_EXTPG 
> stuff for kTLS, I added support for using M_EXTPG mbufs in sendfile 
> regardless of whether or not kTLS was in use.  That reduced CPU use 
> marginally on 64-bit platforms (due to reducing socket buffer lengths, and 
> hence reducing pointer chasing), and quite a bit more on 32-bit platforms 
> (due to also not needing to map memory into the kernel map, and by reducing 
> pointer chasing even more, as more pages fit into an M_EXTPG mbuf when a 
> paddr_t is 32-bits.
>
>
> I do not know how the TSO loop in tcp_output handles M_EXTPG mbufs.
> Does it assume each M_EXTPG mbuf is one contiguous data segment?
>
>
> No, its fully aware of how to handle M_EXTPG mbufs.  Look at tcp_m_copy()  We 
> added code in the segment counting part of that function to count the 
> hdr/trailer parts of an M_EXTPG mbuf, and to deal with the start/end page 
> being misaligned.
>
> I do see that ip_output() will call mb_unmapped_to_ext() when the NIC does 
> not have IFCAP_MEXTPG set.
> (If IFCAP_MEXTPG is set, do the pages need to be contiguous so that it can 
> become
> a single contiguous data segment for TSO or ???)
>
>
> No, it just means that a NIC driver has been verified to call not mtod() an 
> M_EXTPGS mbuf and deref the resulting data pointer. (which would make it go 
> "boom").
>
> But the page size is only 4K on most platforms.  So while an M_EXTPGS mbuf 
> can hold 5 pages (..from memory, too lazy to do the math right now) and 
> reduces socket buffer mbuf chain lengths by a factor of 10 or so (2k vs 20k 
> per mbuf), the S/G list that a NIC will need to consume would likely decrease 
> only by a factor of 2.  And even then only if the busdma code to map mbufs 
> for DMA is not coalescing adjacent mbufs.  I know busdma does some 
> coalescing, but I can't recall if it coalesces physcally adjacent mbufs.

I'm guessing the factor of 2 comes from the fact that each page is a
contiguous segment?

The NFS code could easily use 5 contiguous pages, so maybe it would be
worthwhile
to try and make some NIC drivers capable of handling contiguous pages
as one segment
for TSO output? (It means that tcp_outpout() would need to know this
case was possible,
Maybe a new if_hw_tsoXX that covers the max number of segments if
pages are contig?)

However, given your previous post, it might not matter much, since the
larger TSO
segment might not make much difference?

>
> If TSO and the code beneath it (NIC and maybe mb_unmapped_to_ext() being 
> called) were to
> all work ok for M_EXTPG mbufs, it would be easy to enable that for NFS 
> (non-TLS case).
>
>
>
> It does.  You should enable it for at least TCP.
Good work!!

I will try it someday relatively soon. Even if it only reduces the use
of mbuf clusters,
that sounds like it would be worthwhile.

rick
>
> Drew



Re: Increasing TCP TSO size support

2024-02-02 Thread Drew Gallatin


On Fri, Feb 2, 2024, at 6:13 PM, Rick Macklem wrote:
>  A factor here is the if_hw_tsomaxsegcount limit. For example, a 1Mbyte NFS 
> write request
> or read reply will result in a 514 element mbuf chain. Each of these (mostly 
> 2K mbuf clusters)
> are non-contiguous data segments. (I suspect most NICs do not handle this 
> many segments well,
> if at all.)

Excellent point

> 
> The NFS code does know how to use M_EXTPG mbufs (for NFS over TLS, for the 
> ktls), but I do not
> know what it would take to make these work for non-KTLS TSO?


Sendfile already uses M_EXTPG mbufs... When I was initially doing M_EXTPG stuff 
for kTLS, I added support for using M_EXTPG mbufs in sendfile regardless of 
whether or not kTLS was in use.  That reduced CPU use marginally on 64-bit 
platforms (due to reducing socket buffer lengths, and hence reducing pointer 
chasing), and quite a bit more on 32-bit platforms (due to also not needing to 
map memory into the kernel map, and by reducing pointer chasing even more, as 
more pages fit into an M_EXTPG mbuf when a paddr_t is 32-bits.


> I do not know how the TSO loop in tcp_output handles M_EXTPG mbufs.
> Does it assume each M_EXTPG mbuf is one contiguous data segment?

No, its fully aware of how to handle M_EXTPG mbufs.  Look at tcp_m_copy().  We 
added code in the segment counting part of that function to count the 
hdr/trailer parts of an M_EXTPG mbuf, and to deal with the start/end page being 
misaligned.

> I do see that ip_output() will call mb_unmapped_to_ext() when the NIC does 
> not have IFCAP_MEXTPG set.
> (If IFCAP_MEXTPG is set, do the pages need to be contiguous so that it can 
> become
> a single contiguous data segment for TSO or ???)

No, it just means that a NIC driver has been verified to call not mtod() an 
M_EXTPGS mbuf and deref the resulting data pointer. (which would make it go 
"boom").

But the page size is only 4K on most platforms.  So while an M_EXTPGS mbuf can 
hold 5 pages (..from memory, too lazy to do the math right now) and reduces 
socket buffer mbuf chain lengths by a factor of 10 or so (2k vs 20k per mbuf), 
the S/G list that a NIC will need to consume would likely decrease only by a 
factor of 2.  And even then only if the busdma code to map mbufs for DMA is not 
coalescing adjacent mbufs.  I know busdma does some coalescing, but I can't 
recall if it coalesces physcally adjacent mbufs.  

> If TSO and the code beneath it (NIC and maybe mb_unmapped_to_ext() being 
> called) were to
> all work ok for M_EXTPG mbufs, it would be easy to enable that for NFS 
> (non-TLS case).


It does.  You should enable it for at least TCP.

Drew

Re: Increasing TCP TSO size support

2024-02-02 Thread Rick Macklem
On Fri, Feb 2, 2024 at 1:21 AM Scheffenegger, Richard 
wrote:

>
> Hi,
>
> We have run a test for a RPC workload with 1MB IO sizes, and collected the
> tcp_default_output() len(gth) during the first pass in the output loop.
>
> In such a scenario, where the application frequently introduces small
> pauses (since the next large IO is only sent after the corresponding
> request from the client has been received and processed) between sending
> additional data, the current TSO limit of 64kB TSO maximum (45*1448 in
> effect) requires multiple passes in the output routine to send all the
> allowable (cwnd limited) data.
>
> I'll try to get a data collection with better granulariy above 90 000
> bytes - but even here the average strongly indicates that a majority of
> transmission opportunities are in the 512 kB area - probably also having to
> do with LRO and ACK thinning effects by the client.
>
> With other words, the tcp output has to run about 9 times with TSO, to
> transmit all elegible data - increasing the FreeBSD supported maximum TSO
> size to what current hardware could handle (256kB..1MB) would reduce the
> CPU burden here.
>
>
> Is increasing the sofware supported TSO size to allow for what the NICs
> could nowadays do something anyone apart from us would be interested in (in
> particular, those who work with the drivers)?
>
Reposted after joining freebsd-net@...

 A factor here is the if_hw_tsomaxsegcount limit. For example, a 1Mbyte NFS
write request
or read reply will result in a 514 element mbuf chain. Each of these
(mostly 2K mbuf clusters)
are non-contiguous data segments. (I suspect most NICs do not handle this
many segments well,
if at all.)

The NFS code does know how to use M_EXTPG mbufs (for NFS over TLS, for the
ktls), but I do not
know what it would take to make these work for non-KTLS TSO?
I do not know how the TSO loop in tcp_output handles M_EXTPG mbufs.
Does it assume each M_EXTPG mbuf is one contiguous data segment?
I do see that ip_output() will call mb_unmapped_to_ext() when the NIC does
not have IFCAP_MEXTPG set.
(If IFCAP_MEXTPG is set, do the pages need to be contiguous so that it can
become
a single contiguous data segment for TSO or ???)

If TSO and the code beneath it (NIC and maybe mb_unmapped_to_ext() being
called) were to
all work ok for M_EXTPG mbufs, it would be easy to enable that for NFS
(non-TLS case).

I do not want to hijack this thread, but do others know how TSO interacts
with M_EXTPG
mbufs?

rick


> Best regards,
>
>   Richard
>
>
>
>
> tso size (transmissions < 1448 would not be accounted here at all)
>
> # count
>
> <1000 0
> <2000 23
> <3000 111
> <4000 40
> <5000 30
> <7000 14
> <8000 134
> <9000 442
> <1 9396
> <2 46227
> <3 25646
> <4 33060
> <6 23162
> <7 24368
> <8 19772
> <9 40101
> >=9 75384169
> Average: 578844.44
>
> CAUTION: This email originated from outside of the University of Guelph.
> Do not click links or open attachments unless you recognize the sender and
> know the content is safe. If in doubt, forward suspicious emails to
> ith...@uoguelph.ca.
>
>


Re: Increasing TCP TSO size support

2024-02-02 Thread Drew Gallatin
What is the link speed that you're working with?

A long time ago, when I worked for a now-defunct 10GbE NIC vendor, I 
experimented with the benefits of  TSO as we varied the max TSO size.  I cannot 
recall the platform (it could have been OSX, Solaris, FreeBSD or Linux).  At 
the time (~2006?) the CPU saving benefits of increasing the max TSO size from 
8k to 64k was fairly minimal.In fact, I seem to recall that there was 
almost no benefit to TSO sizes larger than 16K.

I was wondering if you see any difference in your benchmark if you cap max TSO 
size to 8k,  16k,32k, and the default of 64k.  Any change in CPU use, or in 
your benchmark's performance would be interesting to hear about.

Naively, I'd expect the benchmark performance to remain unchanged until you'd 
reduced the TSO size so much as to make the host slower than the wire, thereby 
inserting gaps between TSOs.  That would be reflected in the CPU use as well..

Drew

On Fri, Feb 2, 2024, at 4:21 AM, Scheffenegger, Richard wrote:
> 
> 
> Hi,
> 
> We have run a test for a RPC workload with 1MB IO sizes, and collected the 
> tcp_default_output() len(gth) during the first pass in the output loop.
> 
> In such a scenario, where the application frequently introduces small pauses 
> (since the next large IO is only sent after the corresponding request from 
> the client has been received and processed) between sending additional data, 
> the current TSO limit of 64kB TSO maximum (45*1448 in effect) requires 
> multiple passes in the output routine to send all the allowable (cwnd 
> limited) data.
> 
> I'll try to get a data collection with better granulariy above 90 000 bytes - 
> but even here the average strongly indicates that a majority of transmission 
> opportunities are in the 512 kB area - probably also having to do with LRO 
> and ACK thinning effects by the client.
> 
> With other words, the tcp output has to run about 9 times with TSO, to 
> transmit all elegible data - increasing the FreeBSD supported maximum TSO 
> size to what current hardware could handle (256kB..1MB) would reduce the CPU 
> burden here.
> 
> 
> 
> Is increasing the sofware supported TSO size to allow for what the NICs could 
> nowadays do something anyone apart from us would be interested in (in 
> particular, those who work with the drivers)?
> 
> 
> 
> Best regards,
> 
>   Richard
> 
> 
> 
> 
> 
> 
> 
> tso size (transmissions < 1448 would not be accounted here at all)
> 
> # count
> 
> 
> 
> <1000
> 0
> <2000
> 23
> <3000
> 111
> <4000
> 40
> <5000
> 30
> <7000
> 14
> <8000
> 134
> <9000
> 442
> <1
> 9396
> <2
> 46227
> <3
> 25646
> <4
> 33060
> <6
> 23162
> <7
> 24368
> <8
> 19772
> <9
> 40101
> >=9
> 75384169
> Average:
> 578844.44
> 
> *Attachments:*
>  • OpenPGP_0x17BE5899E0B1439B.asc
>  • OpenPGP_signature.asc