Hi Hans,

Thanks for replying and suggestions.

> Have you tested using FreeBSD main / 14 ?

I tested 14.0-CURRENT built on 2023-04-27, it is indeed much improved.
Now the TCP sender reaches 100Mbps in 4 seconds on a link with 100ms delay.

% uname -a
FreeBSD 14.0-CURRENT #0 main-n262599-60167184abd5: Thu Apr 27 08:09:50 UTC 2023

schen@freebsd14:~/recipes/tpc % bin/tcpperf -c 192.168.0.1 -t 6
Connected 192.168.0.100:59302 -> 192.168.0.1:2009, congestion control: cubic
Time (s)  Throughput   Bitrate    Cwnd    Rwnd  sndbuf  ssthresh  rtt/var
  0.000s   0.00kB/s   0.00kbps  14.1Ki  63.6Ki  32.8Ki    1024Mi  97.8ms/2500
  1.014s    776kB/s   6205kbps   166Ki   992Ki   313Ki    1024Mi  100.0ms/1875
  2.021s   3643kB/s   29.1Mbps   495Ki  1491Ki  1017Ki    1024Mi  100.0ms/1875
  3.029s   7544kB/s   60.3Mbps   932Ki  2096Ki  1817Ki    1024Mi  100.0ms/1875
  4.036s   12.9MB/s    103Mbps  1729Ki  3064Ki  1817Ki    1024Mi  100.0ms/1875
  5.046s   18.2MB/s    145Mbps  2606Ki  3056Ki  1817Ki    1024Mi  96.9ms/6875
  6.090s   17.8MB/s    143Mbps  3074Ki  2974Ki  1817Ki    1024Mi  113.4ms/11250
Sender   transferred 62.0MBytes in 6.090s, throughput: 10.2MBytes/s, 81.4Mbits/s
Receiver transferred 62.0MBytes in 6.191s, throughput: 10.0MBytes/s, 80.1Mbits/s

Cwnd increased much faster than 13.2-RELEASE.
Since 5-th second, the throughput is limited by sndbuf, 1817Ki / 100ms
= 18.2MB/s

Interestingly, it's not due to lro_nsegs, but a side effect of
https://reviews.freebsd.org/D32693.
Namely, the one line change fixed (or vastly improved) the slow-start in 13.x:

--- a/usr/src/sys/conf/files 2023-04-06 17:34:41.000000000 -0700
+++ b/usr/src/sys/conf/files 2023-05-02 23:00:38.000000000 -0700
@@ -4412,6 +4412,7 @@
 netinet/raw_ip.c optional inet | inet6
 netinet/cc/cc.c optional inet | inet6
 netinet/cc/cc_newreno.c optional inet | inet6
+netinet/khelp/h_ertt.c optional inet | inet6
 netinet/sctp_asconf.c optional inet sctp | inet6 sctp
 netinet/sctp_auth.c optional inet sctp | inet6 sctp
 netinet/sctp_bsd_addr.c optional inet sctp | inet6 sctp

Here's the tcpdump after compiling netinet/khelp/h_ertt.c into 13.x
kernel by default:

 0.000 IP src > sink: Flags [S], seq 392582262, win 65535, options
[mss 1460,nop,wscale 6,sackOK,TS val 840935345 ecr 0], length 0
 0.100 IP sink > src: Flags [S.], seq 3065702766, ack 392582263, win
65160, options [mss 1460,sackOK,TS val 408756323 ecr
840935345,nop,wscale 7], length 0
 0.100 IP src > sink: Flags [.], ack 1, win 1027, options [nop,nop,TS
val 840935450 ecr 408756323], length 0

 // First round-trip: cwnd = 10 * MSS
 0.101 IP src > sink: [.], seq 1:14481, ack 1, win 1027, length 14480
 0.201 IP sink > src: [.], ack 14481, win 445, length 0

 // cwnd += 2 * MSS, but sent two segments, for better RTT calculation
 0.201 IP src > sink: [.], seq 14481:15929, ack 1, win 1027, length 1448
 0.202 IP src > sink: [.], seq 15929:31857, ack 1, win 1027, length 15928
 // cwnd == 12 here

 // Got ACK for the 1448 segment, cwnd += 1 * MSS, sent two more segs.
 0.302 IP sink > src: [.], ack 15929, win 501, length 0
 0.302 IP src > sink: [.], seq 31857:33305, ack 1, win 1027, length 1448
 0.302 IP src > sink: [.], seq 33305:34753, ack 1, win 1027, length 1448
 // cwnd == 13 here

 // Got ACK for the 15928 segment, cwnd += 2 * MSS, sent 13-MSS segment
 0.302 IP sink > src: [.], ack 31857, win 440, length 0
 0.302 IP src > sink: [.], seq 34753:53577, ack 1, win 1027, length 18824
 // cwnd == 15 here, bytes in flight = 15 * MSS

 // ACK of 1448 bytes, sent two more segments, typical slow-start
 0.403 IP sink > src: [.], ack 33305, win 501, length 0
 0.403 IP src > sink: [.], seq 53577:55025, ack 1, win 1027, length 1448
 0.403 IP src > sink: [.], seq 55025:56473, ack 1, win 1027, length 1448
 // ACK of 1448 bytes, sent 2-MSS segment, typical slow-start with TSO
 0.403 IP sink > src: [.], ack 34753, win 496, length 0
 0.403 IP src > sink: [.], seq 56473:59369, ack 1, win 1027, length 2896
 // cwnd == 17 here

 // ACK of 18824, cwnd += 2 * MSS, sent 15-MSS segment
 0.403 IP sink > src: [.], ack 53577, win 795, length 0
 0.403 IP src > sink: [.], seq 59369:81089, ack 1, win 1027, length 21720
 // cwnd == 19 here, bytes in flight = 19 * MSS

marked_packet_rtt() in h_ertt.c sometimes turns off TSO for better RTT measure,
resulting in more segments being sent, and more ACK received, then
cwnd could increase faster.

It really sounds like a butterfly effect to me.

Regards,
Shuo

On Tue, May 2, 2023 at 3:04 AM Hans Petter Selasky <h...@selasky.org> wrote:
>
> On 5/2/23 11:14, Hans Petter Selasky wrote:
> > Hi Chen!
> >
> > The FreeBSD mbufs carry the number of ACKs that have been joined
> > together into the following field:
> >
> > m->m_pkthdr.lro_nsegs
> >
> > Can this value be of any use to cc_newreno ?
> >
> > --HPS
>
> Hi Chen,
>
> Have you tested using FreeBSD main / 14 ?
>
> The "nsegs" are passed along like this:
>
> nsegs = max(1, m->m_pkthdr.lro_nsegs);
>
> ...
>
> cc_ack_received(tp, th, nsegs, CC_ACK);
>
> ...
>
> (Newreno - FreeBSD-14)
>
>                                  incr = min(ccv->bytes_this_ack,
>                                      ccv->nsegs * abc_val *
>                                      CCV(ccv, t_maxseg));
>
> And in FreeBSD-10 being mentioned in your article:
>
> (Newreno - FreeBSD-10)
>
>                                  incr = min(ccv->bytes_this_ack,
>                                      V_tcp_abc_l_var * CCV(ccv, t_maxseg));
>
>
> There is no such thing.
>
> This issue may already have been fixed!
>
> --HPS
> >
> > On 5/2/23 09:46, Chen Shuo wrote:
> >> As per newreno_ack_received() in sys/netinet/cc/cc_newreno.c,
> >> FreeBSD TCP sender strictly follows RFC 5681 with RFC 3465 extension
> >> That is, during slow-start, when receiving an ACK of 'bytes_acked'
> >>
> >>      cwnd += min(bytes_acked, abc_l_var * SMSS);  // abc_l_var = 2 dflt
> >>
> >> As discussed in sec3.2 of RFC 3465, L=2*SMSS bytes exactly balances
> >> the negative impact of the delayed ACK algorithm.  RFC 5681 also
> >> requires that a receiver SHOULD generate an ACK for at least every
> >> second full-sized segment, so bytes_acked per ACK is at most 2 * SMSS.
> >> If both sender and receiver follow it. cwnd should grow exponentially
> >> during slow-slow:
> >>
> >>      cwnd *= 2    (per RTT)
> >>
> >> However, LRO and TSO are widely used today, so receiver may generate
> >> much less ACKs than it used to do.  As I observed, Both FreeBSD and
> >> Linux generates at most one ACK per segment assembled by LRO/GRO.
> >> The worst case is one ACK per 45 MSS, as 45 * 1448 = 65160 < 65535.
> >>
> >> Sending 1MB over a link of 100ms delay from FreeBSD 13.2:
> >>
> >>   0.000 IP sender > sink: Flags [S], seq 205083268, win 65535, options
> >> [mss 1460,nop,wscale 10,sackOK,TS val 495212525 ecr 0], length 0
> >>   0.100 IP sink > sender: Flags [S.], seq 708257395, ack 205083269, win
> >> 65160, options [mss 1460,sackOK,TS val 563185696 ecr
> >> 495212525,nop,wscale 7], length 0
> >>   0.100 IP sender > sink: Flags [.], ack 1, win 65, options [nop,nop,TS
> >> val 495212626 ecr 563185696], length 0
> >>   // TSopt omitted below for brevity.
> >>
> >>   // cwnd = 10 * MSS, sent 10 * MSS
> >>   0.101 IP sender > sink: Flags [.], seq 1:14481, ack 1, win 65,
> >> length 14480
> >>
> >>   // got one ACK for 10 * MSS, cwnd += 2 * MSS, sent 12 * MSS
> >>   0.201 IP sink > sender: Flags [.], ack 14481, win 427, length 0
> >>   0.201 IP sender > sink: Flags [.], seq 14481:31857, ack 1, win 65,
> >> length 17376
> >>
> >>   // got ACK of 12*MSS above, cwnd += 2 * MSS, sent 14 * MSS
> >>   0.301 IP sink > sender: Flags [.], ack 31857, win 411, length 0
> >>   0.301 IP sender > sink: Flags [.], seq 31857:52129, ack 1, win 65,
> >> length 20272
> >>
> >>   // got ACK of 14*MSS above, cwnd += 2 * MSS, sent 16 * MSS
> >>   0.402 IP sink > sender: Flags [.], ack 52129, win 395, length 0
> >>   0.402 IP sender > sink: Flags [P.], seq 52129:73629, ack 1, win 65,
> >> length 21500
> >>   0.402 IP sender > sink: Flags [.], seq 73629:75077, ack 1, win 65,
> >> length 1448
> >>
> >> As a consequence, instead of growing exponentially, cwnd grows
> >> more-or-less quadratically during slow-start, unless abc_l_var is
> >> set to a sufficiently large value.
> >>
> >> NewReno took more than 20 seconds to ramp up throughput to 100Mbps
> >> over an emulated 100ms delay link.  While Linux took ~2 seconds.
> >> I can provide the pcap file if anyone is interested.
> >>
> >> Switching to CUBIC won't help, because it uses the logic in NewReno
> >> ack_received() for slow start.
> >>
> >> Is this a well-known issue and abc_l_var is the only cure for it?
> >> https://calomel.org/freebsd_network_tuning.html
> >>
> >> Thank you!
> >>
> >> Best,
> >> Shuo Chen
> >>
> >
> >
>

Reply via email to