On 10/26/21 8:45 PM, Bob McMahon wrote:
> This is linux. The code flow is burst writes until the burst size, take a 
> timestamp, call select(), take second timestamp and insert time delta into 
> histogram, await clock_nanosleep() to schedule the next burst. (actually, the 
> deltas, inserts into the histogram and user i/o are done in another thread, 
> i.e. iperf 2's reporter thread.)
> 
> I still must be missing something.  Does anything else need to be set to 
> reduce the skb size? Everything seems to be indicating 4K writes even when 
> gso_max_size is 2000 (I assume these are units of bytes?) There are ten 
> writes, ten reads and ten  RTTs for the bursts.  I don't see partial writes 
> at the app level. 
> 
>     [root@localhost iperf2-code]# ip link set dev eth1 gso_max_size 2000

You could check with tcpdump on eth1, that outgoing packets are no longer 
'TSO/GSO', but single MSS ones.

(Note: this device gso_max_size is only taken into account for flows 
established after the change)

> 
>     [root@localhost iperf2-code]# ip -d link sh dev eth1
>     9: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state 
> UNKNOWN mode DEFAULT group default qlen 1000
>         link/ether 00:90:4c:40:04:59 brd ff:ff:ff:ff:ff:ff promiscuity 0 
> minmtu 68 maxmtu 1500 addrgenmode eui64 numtxqueues 1 numrxqueues 1 
> gso_max_size 2000 gso_max_segs 65535
>     [root@localhost iperf2-code]# uname -r
>     5.0.9-301.fc30.x86_64
> 
> 
> It looks like RTT is being driven by WiFi TXOPs as doubling the write size 
> increases the aggregation by two but has no significant effect on the RTTs.
> 
>     4K writes: tot_mpdus 328 tot_ampdus 209 mpduperampdu 2
> 
> 
>     8k writes:  tot_mpdus 317 tot_ampdus 107 mpduperampdu 3
> 
> 
> [root@localhost iperf2-code]# src/iperf -c 192.168.1.1%eth1 --trip-times -i 1 
> -e --tcp-write-prefetch 4 -l 4K --burst-size=40K --histograms
> WARN: option of --burst-size without --burst-period defaults --burst-period 
> to 1 second
> ------------------------------------------------------------
> Client connecting to 192.168.1.1, TCP port 5001 with pid 5145 via eth1 (1 
> flows)
> Write buffer size: 4096 Byte
> Bursting: 40.0 KByte every 1.00 seconds
> TCP window size: 85.0 KByte (default)
> Event based writes (pending queue watermark at 4 bytes)
> Enabled select histograms bin-width=0.100 ms, bins=10000
> ------------------------------------------------------------
> [  1] local 192.168.1.4%eth1 port 45680 connected with 192.168.1.1 port 5001 
> (MSS=1448) (prefetch=4) (trip-times) (sock=3) (ct=5.30 ms) on 2021-10-26 
> 20:25:29 (PDT)
> [ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry     
> Cwnd/RTT        NetPwr
> [  1] 0.00-1.00 sec  40.1 KBytes   329 Kbits/sec  11/0          0       
> 14K/10091 us  4
> [  1] 0.00-1.00 sec S8-PDF: 
> bin(w=100us):cnt(10)=1:1,36:1,40:1,44:1,46:1,48:1,49:1,50:2,52:1 
> (5.00/95.00/99.7%=1/52/52,Outliers=0,obl/obu=0/0) (5.121 ms/1635305129.152339)
> [  1] 1.00-2.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0       
> 14K/4990 us  8
> [  1] 1.00-2.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,39:1,45:1,49:5,50:1 
> (5.00/95.00/99.7%=1/50/50,Outliers=0,obl/obu=0/0) (4.991 ms/1635305130.153330)
> [  1] 2.00-3.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0       
> 14K/4904 us  8
> [  1] 2.00-3.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,29:1,49:4,50:1,59:1,75:1 
> (5.00/95.00/99.7%=1/75/75,Outliers=0,obl/obu=0/0) (7.455 ms/1635305131.147353)
> [  1] 3.00-4.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0       
> 14K/4964 us  8
> [  1] 3.00-4.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,49:4,50:2,59:1,65:1 
> (5.00/95.00/99.7%=1/65/65,Outliers=0,obl/obu=0/0) (6.460 ms/1635305132.146338)
> [  1] 4.00-5.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0       
> 14K/4970 us  8
> [  1] 4.00-5.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,49:6,59:1,65:1 
> (5.00/95.00/99.7%=1/65/65,Outliers=0,obl/obu=0/0) (6.404 ms/1635305133.146335)
> [  1] 5.00-6.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0       
> 14K/4986 us  8
> [  1] 5.00-6.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,48:1,49:1,50:4,59:1,64:1 
> (5.00/95.00/99.7%=1/64/64,Outliers=0,obl/obu=0/0) (6.395 ms/1635305134.146343)
> [  1] 6.00-7.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0       
> 14K/5059 us  8
> [  1] 6.00-7.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,39:1,49:3,50:2,60:1,85:1 
> (5.00/95.00/99.7%=1/85/85,Outliers=0,obl/obu=0/0) (8.417 ms/1635305135.148343)
> [  1] 7.00-8.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0       
> 14K/5407 us  8
> [  1] 7.00-8.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,40:1,49:4,50:1,59:1,75:1 
> (5.00/95.00/99.7%=1/75/75,Outliers=0,obl/obu=0/0) (7.428 ms/1635305136.147343)
> [  1] 8.00-9.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0       
> 14K/5188 us  8
> [  1] 8.00-9.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,40:1,49:3,50:3,64:1 
> (5.00/95.00/99.7%=1/64/64,Outliers=0,obl/obu=0/0) (6.388 ms/1635305137.146284)
> [  1] 9.00-10.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0       
> 14K/5306 us  8
> [  1] 9.00-10.00 sec S8-PDF: 
> bin(w=100us):cnt(10)=1:2,39:1,49:2,50:2,51:1,60:1,65:1 
> (5.00/95.00/99.7%=1/65/65,Outliers=0,obl/obu=0/0) (6.422 ms/1635305138.146316)
> [  1] 0.00-10.01 sec   400 KBytes   327 Kbits/sec  102/0          0       
> 14K/5939 us  7
> [  1] 0.00-10.01 sec S8(f)-PDF: 
> bin(w=100us):cnt(100)=1:19,29:1,36:1,39:3,40:3,44:1,45:1,46:1,48:2,49:33,50:18,51:1,52:1,59:5,60:2,64:2,65:3,75:2,85:1
>  (5.00/95.00/99.7%=1/65/85,Outliers=0,obl/obu=0/0) (8.417 
> ms/1635305135.148343)
> 
> [root@localhost iperf2-code]# src/iperf -s -i 1 -e -B 192.168.1.1%eth1
> ------------------------------------------------------------
> Server listening on TCP port 5001 with pid 6287
> Binding to local address 192.168.1.1 and iface eth1
> Read buffer size:  128 KByte (Dist bin width=16.0 KByte)
> TCP window size:  128 KByte (default)
> ------------------------------------------------------------
> [  1] local 192.168.1.1%eth1 port 5001 connected with 192.168.1.4 port 45680 
> (MSS=1448) (burst-period=1.0000s) (trip-times) (sock=4) (peer 2.1.4-master) 
> on 2021-10-26 20:25:29 (PDT)
> [ ID] Burst (start-end)  Transfer     Bandwidth       XferTime  (DC%)     
> Reads=Dist          NetPwr
> [  1] 0.0001-0.0500 sec  40.1 KBytes  6.59 Mbits/sec  49.848 ms (5%)    
> 12=12:0:0:0:0:0:0:0  0
> [  1] 1.0002-1.0461 sec  40.0 KBytes  7.14 Mbits/sec  45.913 ms (4.6%)    
> 10=10:0:0:0:0:0:0:0  0
> [  1] 2.0002-2.0491 sec  40.0 KBytes  6.70 Mbits/sec  48.876 ms (4.9%)    
> 11=11:0:0:0:0:0:0:0  0
> [  1] 3.0002-3.0501 sec  40.0 KBytes  6.57 Mbits/sec  49.886 ms (5%)    
> 10=10:0:0:0:0:0:0:0  0
> [  1] 4.0002-4.0501 sec  40.0 KBytes  6.57 Mbits/sec  49.887 ms (5%)    
> 10=10:0:0:0:0:0:0:0  0
> [  1] 5.0002-5.0501 sec  40.0 KBytes  6.57 Mbits/sec  49.881 ms (5%)    
> 10=10:0:0:0:0:0:0:0  0
> [  1] 6.0002-6.0511 sec  40.0 KBytes  6.44 Mbits/sec  50.895 ms (5.1%)    
> 10=10:0:0:0:0:0:0:0  0
> [  1] 7.0002-7.0501 sec  40.0 KBytes  6.57 Mbits/sec  49.889 ms (5%)    
> 10=10:0:0:0:0:0:0:0  0
> [  1] 8.0002-8.0481 sec  40.0 KBytes  6.84 Mbits/sec  47.901 ms (4.8%)    
> 11=11:0:0:0:0:0:0:0  0
> [  1] 9.0002-9.0491 sec  40.0 KBytes  6.70 Mbits/sec  48.872 ms (4.9%)    
> 10=10:0:0:0:0:0:0:0  0
> [  1] 0.0000-10.0031 sec   400 KBytes   328 Kbits/sec               
> 104=104:0:0:0:0:0:0:0
> 
> Bob
> 
> On Tue, Oct 26, 2021 at 6:12 PM Eric Dumazet <eric.duma...@gmail.com 
> <mailto:eric.duma...@gmail.com>> wrote:
> 
> 
> 
>     On 10/26/21 4:38 PM, Christoph Paasch wrote:
>     > Hi Bob,
>     >
>     >> On Oct 26, 2021, at 4:23 PM, Bob McMahon <bob.mcma...@broadcom.com 
> <mailto:bob.mcma...@broadcom.com> <mailto:bob.mcma...@broadcom.com 
> <mailto:bob.mcma...@broadcom.com>>> wrote:
>     >> I'm confused. I don't see any blocking nor partial writes per the 
> write() at the app level with TCP_NOTSENT_LOWAT set at 4 bytes. The burst is 
> 40K, the write size is 4K and the watermark is 4 bytes. There are ten writes 
> per burst.
>     >
>     > You are on Linux here, right?
>     >
>     > AFAICS, Linux will still accept whatever fits in an skb. And that is 
> likely more than 4K (with GSO on by default).
> 
>     This (max payload per skb) can be tuned at the driver level, at least for 
> experimental purposes or dedicated devices.
> 
>     ip link set dev eth0 gso_max_size 8000
> 
>     To fetch current values :
> 
>     ip -d link sh dev eth0
> 
> 
>     >
>     > However, do you go back to select() after each write() or do you loop 
> over the write() calls?
>     >
>     >
>     > Christoph
>     >
>     >> The S8 histograms are the times waiting on the select().  The first 
> value is the bin number (multiplied by 100usec bin width) and second the bin 
> count. The worst case time is at the end and is timestamped per unix epoch.
>     >>
>     >> The second run is over a controlled WiFi link where a 99.7% point of 
> 4-8ms for a WiFi TX op arbitration win is in the ballpark. The first is 1G 
> wired and is in the 600 usec range. (No media arbitration there.)
>     >>
>     >>  [root@localhost iperf2-code]# src/iperf -c 10.19.87.9 --trip-times -i 
> 1 -e --tcp-write-prefetch 4 -l 4K --burst-size=40K --histograms
>     >> WARN: option of --burst-size without --burst-period defaults 
> --burst-period to 1 second
>     >> ------------------------------------------------------------
>     >> Client connecting to 10.19.87.9, TCP port 5001 with pid 2124 (1 flows)
>     >> Write buffer size: 4096 Byte
>     >> Bursting: 40.0 KByte every 1.00 seconds
>     >> TCP window size: 85.0 KByte (default)
>     >> Event based writes (pending queue watermark at 4 bytes)
>     >> Enabled select histograms bin-width=0.100 ms, bins=10000
>     >> ------------------------------------------------------------
>     >> [  1] local 10.19.87.10%eth0 port 33166 connected with 10.19.87.9 port 
> 5001 (MSS=1448) (prefetch=4) (trip-times) (sock=3) (ct=0.54 ms) on 2021-10-26 
> 16:07:33 (PDT)
>     >> [ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry     
> Cwnd/RTT        NetPwr
>     >> [  1] 0.00-1.00 sec  40.1 KBytes   329 Kbits/sec  11/0          0      
>  14K/5368 us  8
>     >> [  1] 0.00-1.00 sec S8-PDF: bin(w=100us):cnt(10)=1:1,2:5,3:2,4:1,11:1 
> (5.00/95.00/99.7%=1/11/11,Outliers=0,obl/obu=0/0) (1.089 ms/1635289653.928360)
>     >> [  1] 1.00-2.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/569 us  72
>     >> [  1] 1.00-2.00 sec S8-PDF: 
> bin(w=100us):cnt(10)=1:2,2:1,3:4,4:1,7:1,8:1 
> (5.00/95.00/99.7%=1/8/8,Outliers=0,obl/obu=0/0) (0.736 ms/1635289654.928088)
>     >> [  1] 2.00-3.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/312 us  131
>     >> [  1] 2.00-3.00 sec S8-PDF: bin(w=100us):cnt(10)=1:3,2:2,3:2,5:2,6:1 
> (5.00/95.00/99.7%=1/6/6,Outliers=0,obl/obu=0/0) (0.548 ms/1635289655.927776)
>     >> [  1] 3.00-4.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/302 us  136
>     >> [  1] 3.00-4.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,2:2,3:5,6:1 
> (5.00/95.00/99.7%=1/6/6,Outliers=0,obl/obu=0/0) (0.584 ms/1635289656.927814)
>     >> [  1] 4.00-5.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/316 us  130
>     >> [  1] 4.00-5.00 sec S8-PDF: bin(w=100us):cnt(10)=1:3,3:2,4:2,5:2,6:1 
> (5.00/95.00/99.7%=1/6/6,Outliers=0,obl/obu=0/0) (0.572 ms/1635289657.927810)
>     >> [  1] 5.00-6.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/253 us  162
>     >> [  1] 5.00-6.00 sec S8-PDF: bin(w=100us):cnt(10)=1:3,2:2,3:4,5:1 
> (5.00/95.00/99.7%=1/5/5,Outliers=0,obl/obu=0/0) (0.417 ms/1635289658.927630)
>     >> [  1] 6.00-7.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/290 us  141
>     >> [  1] 6.00-7.00 sec S8-PDF: bin(w=100us):cnt(10)=1:3,3:3,4:3,6:1 
> (5.00/95.00/99.7%=1/6/6,Outliers=0,obl/obu=0/0) (0.573 ms/1635289659.927771)
>     >> [  1] 7.00-8.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/359 us  114
>     >> [  1] 7.00-8.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,3:4,4:3,6:1 
> (5.00/95.00/99.7%=1/6/6,Outliers=0,obl/obu=0/0) (0.570 ms/1635289660.927753)
>     >> [  1] 8.00-9.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/349 us  117
>     >> [  1] 8.00-9.00 sec S8-PDF: bin(w=100us):cnt(10)=1:3,3:5,4:1,7:1 
> (5.00/95.00/99.7%=1/7/7,Outliers=0,obl/obu=0/0) (0.608 ms/1635289661.927843)
>     >> [  1] 9.00-10.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0     
>   14K/347 us  118
>     >> [  1] 9.00-10.00 sec S8-PDF: bin(w=100us):cnt(10)=1:3,2:1,3:5,8:1 
> (5.00/95.00/99.7%=1/8/8,Outliers=0,obl/obu=0/0) (0.725 ms/1635289662.927861)
>     >> [  1] 0.00-10.01 sec   400 KBytes   327 Kbits/sec  102/0          0    
>    14K/1519 us  27
>     >> [  1] 0.00-10.01 sec S8(f)-PDF: 
> bin(w=100us):cnt(100)=1:25,2:13,3:36,4:11,5:5,6:5,7:2,8:2,11:1 
> (5.00/95.00/99.7%=1/7/11,Outliers=0,obl/obu=0/0) (1.089 ms/1635289653.928360)
>     >>
>     >> [root@localhost iperf2-code]# src/iperf -c 192.168.1.1 --trip-times -i 
> 1 -e --tcp-write-prefetch 4 -l 4K --burst-size=40K --histograms
>     >> WARN: option of --burst-size without --burst-period defaults 
> --burst-period to 1 second
>     >> ------------------------------------------------------------
>     >> Client connecting to 192.168.1.1, TCP port 5001 with pid 2131 (1 flows)
>     >> Write buffer size: 4096 Byte
>     >> Bursting: 40.0 KByte every 1.00 seconds
>     >> TCP window size: 85.0 KByte (default)
>     >> Event based writes (pending queue watermark at 4 bytes)
>     >> Enabled select histograms bin-width=0.100 ms, bins=10000
>     >> ------------------------------------------------------------
>     >> [  1] local 192.168.1.4%eth1 port 45518 connected with 192.168.1.1 
> port 5001 (MSS=1448) (prefetch=4) (trip-times) (sock=3) (ct=5.48 ms) on 
> 2021-10-26 16:07:56 (PDT)
>     >> [ ID] Interval        Transfer    Bandwidth       Write/Err  Rtry     
> Cwnd/RTT        NetPwr
>     >> [  1] 0.00-1.00 sec  40.1 KBytes   329 Kbits/sec  11/0          0      
>  14K/10339 us  4
>     >> [  1] 0.00-1.00 sec S8-PDF: 
> bin(w=100us):cnt(10)=1:1,40:1,47:1,49:2,50:3,51:1,60:1 
> (5.00/95.00/99.7%=1/60/60,Outliers=0,obl/obu=0/0) (5.990 ms/1635289676.802143)
>     >> [  1] 1.00-2.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/4853 us  8
>     >> [  1] 1.00-2.00 sec S8-PDF: 
> bin(w=100us):cnt(10)=1:2,38:1,39:1,44:1,45:1,49:1,51:1,52:1,60:1 
> (5.00/95.00/99.7%=1/60/60,Outliers=0,obl/obu=0/0) (5.937 ms/1635289677.802274)
>     >> [  1] 2.00-3.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/4991 us  8
>     >> [  1] 2.00-3.00 sec S8-PDF: 
> bin(w=100us):cnt(10)=1:2,48:1,49:2,50:2,51:1,60:1,64:1 
> (5.00/95.00/99.7%=1/64/64,Outliers=0,obl/obu=0/0) (6.307 ms/1635289678.794326)
>     >> [  1] 3.00-4.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/4610 us  9
>     >> [  1] 3.00-4.00 sec S8-PDF: 
> bin(w=100us):cnt(10)=1:2,49:3,50:3,56:1,64:1 
> (5.00/95.00/99.7%=1/64/64,Outliers=0,obl/obu=0/0) (6.362 ms/1635289679.794335)
>     >> [  1] 4.00-5.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/5028 us  8
>     >> [  1] 4.00-5.00 sec S8-PDF: bin(w=100us):cnt(10)=1:2,49:6,59:1,64:1 
> (5.00/95.00/99.7%=1/64/64,Outliers=0,obl/obu=0/0) (6.367 ms/1635289680.794399)
>     >> [  1] 5.00-6.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/5113 us  8
>     >> [  1] 5.00-6.00 sec S8-PDF: 
> bin(w=100us):cnt(10)=1:2,49:3,50:2,58:1,60:1,65:1 
> (5.00/95.00/99.7%=1/65/65,Outliers=0,obl/obu=0/0) (6.442 ms/1635289681.794392)
>     >> [  1] 6.00-7.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/5054 us  8
>     >> [  1] 6.00-7.00 sec S8-PDF: 
> bin(w=100us):cnt(10)=1:2,39:1,49:3,51:1,60:2,64:1 
> (5.00/95.00/99.7%=1/64/64,Outliers=0,obl/obu=0/0) (6.374 ms/1635289682.794335)
>     >> [  1] 7.00-8.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/5138 us  8
>     >> [  1] 7.00-8.00 sec S8-PDF: 
> bin(w=100us):cnt(10)=1:2,39:2,40:1,49:2,50:1,60:1,64:1 
> (5.00/95.00/99.7%=1/64/64,Outliers=0,obl/obu=0/0) (6.396 ms/1635289683.794338)
>     >> [  1] 8.00-9.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0      
>  14K/5329 us  8
>     >> [  1] 8.00-9.00 sec S8-PDF: 
> bin(w=100us):cnt(10)=1:2,38:1,45:2,49:1,50:3,63:1 
> (5.00/95.00/99.7%=1/63/63,Outliers=0,obl/obu=0/0) (6.292 ms/1635289684.794262)
>     >> [  1] 9.00-10.00 sec  40.0 KBytes   328 Kbits/sec  10/0          0     
>   14K/5329 us  8
>     >> [  1] 9.00-10.00 sec S8-PDF: 
> bin(w=100us):cnt(10)=1:2,39:1,49:3,50:3,84:1 
> (5.00/95.00/99.7%=1/84/84,Outliers=0,obl/obu=0/0) (8.306 ms/1635289685.796315)
>     >> [  1] 0.00-10.01 sec   400 KBytes   327 Kbits/sec  102/0          0    
>    14K/6331 us  6
>     >> [  1] 0.00-10.01 sec S8(f)-PDF: 
> bin(w=100us):cnt(100)=1:19,38:2,39:5,40:2,44:1,45:3,47:1,48:1,49:26,50:17,51:4,52:1,56:1,58:1,59:1,60:7,63:1,64:5,65:1,84:1
>  (5.00/95.00/99.7%=1/64/84,Outliers=0,obl/obu=0/0) (8.306 
> ms/1635289685.796315)
>     >>
>     >> Bob
>     >>
>     >> On Tue, Oct 26, 2021 at 11:45 AM Christoph Paasch <cpaa...@apple.com 
> <mailto:cpaa...@apple.com> <mailto:cpaa...@apple.com 
> <mailto:cpaa...@apple.com>>> wrote:
>     >>
>     >>     Hello,
>     >>
>     >>     > On Oct 25, 2021, at 9:24 PM, Eric Dumazet 
> <eric.duma...@gmail.com <mailto:eric.duma...@gmail.com> 
> <mailto:eric.duma...@gmail.com <mailto:eric.duma...@gmail.com>>> wrote:
>     >>     >
>     >>     >
>     >>     >
>     >>     > On 10/25/21 8:11 PM, Stuart Cheshire via Bloat wrote:
>     >>     >> On 21 Oct 2021, at 17:51, Bob McMahon via Make-wifi-fast 
> <make-wifi-f...@lists.bufferbloat.net 
> <mailto:make-wifi-f...@lists.bufferbloat.net> 
> <mailto:make-wifi-f...@lists.bufferbloat.net 
> <mailto:make-wifi-f...@lists.bufferbloat.net>>> wrote:
>     >>     >>
>     >>     >>> Hi All,
>     >>     >>>
>     >>     >>> Sorry for the spam. I'm trying to support a meaningful TCP 
> message latency w/iperf 2 from the sender side w/o requiring e2e clock 
> synchronization. I thought I'd try to use the TCP_NOTSENT_LOWAT event to help 
> with this. It seems that this event goes off when the bytes are in flight vs 
> have reached the destination network stack. If that's the case, then iperf 2 
> client (sender) may be able to produce the message latency by adding the 
> drain time (write start to TCP_NOTSENT_LOWAT) and the sampled RTT.
>     >>     >>>
>     >>     >>> Does this seem reasonable?
>     >>     >>
>     >>     >> I’m not 100% sure what you’re asking, but I will try to help.
>     >>     >>
>     >>     >> When you set TCP_NOTSENT_LOWAT, the TCP implementation won’t 
> report your endpoint as writable (e.g., via kqueue or epoll) until less than 
> that threshold of data remains unsent. It won’t stop you writing more bytes 
> if you want to, up to the socket send buffer size, but it won’t *ask* you for 
> more data until the TCP_NOTSENT_LOWAT threshold is reached.
>     >>     >
>     >>     >
>     >>     > When I implemented TCP_NOTSENT_LOWAT back in 2013 [1], I made 
> sure that sendmsg() would actually
>     >>     > stop feeding more bytes in TCP transmit queue if the current 
> amount of unsent bytes
>     >>     > was above the threshold.
>     >>     >
>     >>     > So it looks like Apple implementation is different, based on 
> your description ?
>     >>
>     >>     Yes, TCP_NOTSENT_LOWAT only impacts the wakeup on iOS/macOS/...
>     >>
>     >>     An app can still fill the send-buffer if it does a sendmsg() with 
> a large buffer or does repeated calls to sendmsg().
>     >>
>     >>     Fur Apple, the goal of TCP_NOTSENT_LOWAT was to allow an app to 
> quickly change the data it "scheduled" to send. And thus allow the app to 
> write the smallest "logical unit" it has. If that unit is 512KB large, the 
> app is allowed to send that.
>     >>     For example, in case of video-streaming one may want to skip ahead 
> in the video. In that case the app still needs to transmit the remaining 
> parts of the previous frame anyways, before it can send the new video frame.
>     >>     That's the reason why the Apple implementation allows one to write 
> more than just the lowat threshold.
>     >>
>     >>
>     >>     That being said, I do think that Linux's way allows for an easier 
> API because the app does not need to be careful at how much data it sends 
> after an epoll/kqueue wakeup. So, the latency-benefits will be easier to get.
>     >>
>     >>
>     >>     Christoph
>     >>
>     >>
>     >>
>     >>     > [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=c9bee3b7fdecb0c1d070c7b54113b3bdfb9a3d36
>  
> <https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=c9bee3b7fdecb0c1d070c7b54113b3bdfb9a3d36>
>  
> <https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=c9bee3b7fdecb0c1d070c7b54113b3bdfb9a3d36
>  
> <https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=c9bee3b7fdecb0c1d070c7b54113b3bdfb9a3d36>>
>     >>     >
>     >>     > netperf does not use epoll(), but rather a loop over sendmsg().
>     >>     >
>     >>     > One of the point of TCP_NOTSENT_LOWAT for Google was to be able 
> to considerably increase
>     >>     > max number of bytes in transmit queues (3rd column of 
> /proc/sys/net/ipv4/tcp_wmem)
>     >>     > by 10x, allowing for autotune to increase BDP for big RTT flows, 
> this without
>     >>     > increasing memory needs for flows with small RTT.
>     >>     >
>     >>     > In other words, the TCP implementation attempts to keep BDP 
> bytes in flight + TCP_NOTSENT_LOWAT bytes buffered and ready to go. The BDP 
> of bytes in flight is necessary to fill the network pipe and get good 
> throughput. The TCP_NOTSENT_LOWAT of bytes buffered and ready to go is 
> provided to give the source software some advance notice that the TCP 
> implementation will soon be looking for more bytes to send, so that the 
> buffer doesn’t run dry, thereby lowering throughput. (The old SO_SNDBUF 
> option conflates both “bytes in flight” and “bytes buffered and ready to go” 
> into the same number.)
>     >>     >>
>     >>     >> If you wait for the TCP_NOTSENT_LOWAT notification, write a 
> chunk of n bytes of data, and then wait for the next TCP_NOTSENT_LOWAT 
> notification, that will tell you roughly how long it took n bytes to depart 
> the machine. You won’t know why, though. The bytes could depart the machine 
> in response for acks indicating that the same number of bytes have been 
> accepted at the receiver. But the bytes can also depart the machine because 
> CWND is growing. Of course, both of those things are usually happening at the 
> same time.
>     >>     >>
>     >>     >> How to use TCP_NOTSENT_LOWAT is explained in this video:
>     >>     >>
>     >>     >> 
> <https://developer.apple.com/videos/play/wwdc2015/719/?time=2199 
> <https://developer.apple.com/videos/play/wwdc2015/719/?time=2199> 
> <https://developer.apple.com/videos/play/wwdc2015/719/?time=2199 
> <https://developer.apple.com/videos/play/wwdc2015/719/?time=2199>>>
>     >>     >>
>     >>     >> Later in the same video is a two-minute demo (time offset 42:00 
> to time offset 44:00) showing a “before and after” demo illustrating the 
> dramatic difference this makes for screen sharing responsiveness.
>     >>     >>
>     >>     >> 
> <https://developer.apple.com/videos/play/wwdc2015/719/?time=2520 
> <https://developer.apple.com/videos/play/wwdc2015/719/?time=2520> 
> <https://developer.apple.com/videos/play/wwdc2015/719/?time=2520 
> <https://developer.apple.com/videos/play/wwdc2015/719/?time=2520>>>
>     >>     >>
>     >>     >> Stuart Cheshire
>     >>     >> _______________________________________________
>     >>     >> Bloat mailing list
>     >>     >> bl...@lists.bufferbloat.net 
> <mailto:bl...@lists.bufferbloat.net> <mailto:bl...@lists.bufferbloat.net 
> <mailto:bl...@lists.bufferbloat.net>>
>     >>     >> https://lists.bufferbloat.net/listinfo/bloat 
> <https://lists.bufferbloat.net/listinfo/bloat> 
> <https://lists.bufferbloat.net/listinfo/bloat 
> <https://lists.bufferbloat.net/listinfo/bloat>>
>     >>     >>
>     >>     > _______________________________________________
>     >>     > Bloat mailing list
>     >>     > bl...@lists.bufferbloat.net <mailto:bl...@lists.bufferbloat.net> 
> <mailto:bl...@lists.bufferbloat.net <mailto:bl...@lists.bufferbloat.net>>
>     >>     > https://lists.bufferbloat.net/listinfo/bloat 
> <https://lists.bufferbloat.net/listinfo/bloat> 
> <https://lists.bufferbloat.net/listinfo/bloat 
> <https://lists.bufferbloat.net/listinfo/bloat>>
>     >>
>     >>
>     >> This electronic communication and the information and any files 
> transmitted with it, or attached to it, are confidential and are intended 
> solely for the use of the individual or entity to whom it is addressed and 
> may contain information that is confidential, legally privileged, protected 
> by privacy laws, or otherwise restricted from disclosure to anyone else. If 
> you are not the intended recipient or the person responsible for delivering 
> the e-mail to the intended recipient, you are hereby notified that any use, 
> copying, distributing, dissemination, forwarding, printing, or copying of 
> this e-mail is strictly prohibited. If you received this e-mail in error, 
> please return the e-mail to the sender, delete it from your computer, and 
> destroy any printed copy of it.
> 
> 
> This electronic communication and the information and any files transmitted 
> with it, or attached to it, are confidential and are intended solely for the 
> use of the individual or entity to whom it is addressed and may contain 
> information that is confidential, legally privileged, protected by privacy 
> laws, or otherwise restricted from disclosure to anyone else. If you are not 
> the intended recipient or the person responsible for delivering the e-mail to 
> the intended recipient, you are hereby notified that any use, copying, 
> distributing, dissemination, forwarding, printing, or copying of this e-mail 
> is strictly prohibited. If you received this e-mail in error, please return 
> the e-mail to the sender, delete it from your computer, and destroy any 
> printed copy of it.
_______________________________________________
Cake mailing list
Cake@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cake

Reply via email to