Hi Eric, thank you for the quick review.
On Mon, Mar 09, 2026 at 10:22:39AM +0100, Eric Dumazet wrote: > On Mon, Mar 9, 2026 at 9:03???AM Simon Baatz via B4 Relay > <[email protected]> wrote: > > > > From: Simon Baatz <[email protected]> > > > > By default, the Linux TCP implementation does not shrink the > > advertised window (RFC 7323 calls this "window retraction") with the > > following exceptions: > > > > - When an incoming segment cannot be added due to the receive buffer > > running out of memory. Since commit 8c670bdfa58e ("tcp: correct > > handling of extreme memory squeeze") a zero window will be > > advertised in this case. It turns out that reaching the required > > memory pressure is easy when window scaling is in use. In the > > simplest case, sending a sufficient number of segments smaller than > > the scale factor to a receiver that does not read data is enough. > > > > - Commit b650d953cd39 ("tcp: enforce receive buffer memory limits by > > allowing the tcp window to shrink") addressed the "eating memory" > > problem by introducing a sysctl knob that allows shrinking the > > window before running out of memory. > > > > However, RFC 7323 does not only state that shrinking the window is > > necessary in some cases, it also formulates requirements for TCP > > implementations when doing so (Section 2.4). > > > > This commit addresses the receiver-side requirements: After retracting > > the window, the peer may have a snd_nxt that lies within a previously > > advertised window but is now beyond the retracted window. This means > > that all incoming segments (including pure ACKs) will be rejected > > until the application happens to read enough data to let the peer's > > snd_nxt be in window again (which may be never). > > > > To comply with RFC 7323, the receiver MUST honor any segment that > > would have been in window for any ACK sent by the receiver and, when > > window scaling is in effect, SHOULD track the maximum window sequence > > number it has advertised. This patch tracks that maximum window > > sequence number rcv_mwnd_seq throughout the connection and uses it in > > tcp_sequence() when deciding whether a segment is acceptable. > > > > rcv_mwnd_seq is updated together with rcv_wup and rcv_wnd in > > tcp_select_window(). If we count tcp_sequence() as fast path, it is > > read in the fast path. Therefore, rcv_mwnd_seq is put into rcv_wnd's > > cacheline group. > > > > The logic for handling received data in tcp_data_queue() is already > > sufficient and does not need to be updated. > > > > Signed-off-by: Simon Baatz <[email protected]> > > ... > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > index > > f0ebcc7e287173be6198fd100130e7ba1a1dbf03..c86910d147f2394bf414d7691d8f90ed41c1b0e3 > > 100644 > > --- a/net/ipv4/tcp_output.c > > +++ b/net/ipv4/tcp_output.c > > @@ -293,6 +293,7 @@ static u16 tcp_select_window(struct sock *sk) > > tp->pred_flags = 0; > > tp->rcv_wnd = 0; > > tp->rcv_wup = tp->rcv_nxt; > > + tcp_update_max_rcv_wnd_seq(tp); > > Presumably we do not need tcp_update_max_rcv_wnd_seq() here ? When we don't update here and are forced to accept a beyond-window packet because the receive queue is empty, we can reach a state where rcv_mwnd_seq < rcv_wup + rcv_wnd == rcv_nxt I noticed this case when instrumenting the kernel and got violations of the invariant rcv_wup + rcv_wnd <= rcv_mwnd_seq. So, while not strictly needed (tcp_max_receive_window() would still be 0 as rcv_nxt > rcv_mwnd_seq), I opted to include the call here to keep rcv_mwnd_seq the actual maximum sequence number at all times. > > Otherwise patch looks good, thanks. -- Simon Baatz <[email protected]>

