On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <[email protected]> wrote: > > Once a socket enters repair mode (TCP_REPAIR socket option with > TCP_REPAIR_ON value), it's possible to dump the receive sequence > number (TCP_QUEUE_SEQ) and the contents of the receive queue itself > (using TCP_REPAIR_QUEUE to select it). > > If we receive data after the application fetched the sequence number > or saved the contents of the queue, though, the application will now > have outdated information, which defeats the whole functionality, > because this leads to gaps in sequence and data once they're restored > by the target instance of the application, resulting in a hanging or > otherwise non-functional TCP connection. > > This type of race condition was discovered in the KubeVirt integration > of passt(1), using a remote iperf3 client connected to an iperf3 > server running in the guest which is being migrated. The setup allows > traffic to reach the origin node hosting the guest during the > migration. > > If passt dumps sequence number and contents of the queue *before* > further data is received and acknowledged to the peer by the kernel, > once the TCP data connection is migrated to the target node, the > remote client becomes unable to continue sending, because a portion > of the data it sent *and received an acknowledgement for* is now lost. > > Schematically: > > 1. client --seq 1:100--> origin host --> passt --> guest --> server > > 2. client <--ACK: 100-- origin host > > 3. migration starts,
Here, a netfilter rule or bpf prog must be installed to drop packets temporarily until migration completes. We do not want unlikely tests in the fast path. You can find a similar issue: https://lore.kernel.org/netdev/[email protected]/ > passt enables repair mode, dumps the sequence > number (101) and sends it to the target node of the guest migration > > 4. client --seq 101:201--> origin host (passt not receiving anymore) > > 5. client <--ACK: 201-- origin host > > 6. migration completes, and passt restores sequence number 101 on the > migrated socket > > 7. client --seq 201:301--> target host (now seeing a sequence jump) > > 8. client <--ACK: 100-- target host > > ...and the connection can't recover anymore, because the client can't > resend data that was already (erroneously) acknowledged. We need to > avoid step 5. above. > > This would equally affect CRIU (the other known user of TCP_REPAIR), > should data be received while the original container is frozen: the > sequence dumped and the contents of the saved incoming queue would > then depend on the timing. > > The race condition is also illustrated in the kselftests introduced > by the next patch. > > To prevent this issue, discard data received for a socket in repair > mode, with a new reason, SKB_DROP_REASON_SOCKET_REPAIR. > > Fixes: ee9952831cfd ("tcp: Initial repair mode") > Tested-by: Laurent Vivier <[email protected]> > Signed-off-by: Stefano Brivio <[email protected]> > --- > include/net/dropreason-core.h | 3 +++ > net/ipv4/tcp_input.c | 14 +++++++++++++- > 2 files changed, 16 insertions(+), 1 deletion(-) > > diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h > index 2f312d1f67d6..19ab9e6ffc33 100644 > --- a/include/net/dropreason-core.h > +++ b/include/net/dropreason-core.h > @@ -9,6 +9,7 @@ > FN(SOCKET_CLOSE) \ > FN(SOCKET_FILTER) \ > FN(SOCKET_RCVBUFF) \ > + FN(SOCKET_REPAIR) \ > FN(UNIX_DISCONNECT) \ > FN(UNIX_SKIP_OOB) \ > FN(PKT_TOO_SMALL) \ > @@ -158,6 +159,8 @@ enum skb_drop_reason { > SKB_DROP_REASON_SOCKET_FILTER, > /** @SKB_DROP_REASON_SOCKET_RCVBUFF: socket receive buff is full */ > SKB_DROP_REASON_SOCKET_RCVBUFF, > + /** @SKB_DROP_REASON_SOCKET_REPAIR: socket is in repair mode */ > + SKB_DROP_REASON_SOCKET_REPAIR, > /** > * @SKB_DROP_REASON_UNIX_DISCONNECT: recv queue is purged when > SOCK_DGRAM > * or SOCK_SEQPACKET socket re-connect()s to another socket or notices > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index d5c9e65d9760..6eca34274f97 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -6457,6 +6457,7 @@ static bool tcp_validate_incoming(struct sock *sk, > struct sk_buff *skb, > * or pure receivers (this means either the sequence number or the ack > * value must stay constant) > * - Unexpected TCP option. > + * - Socket is in repair mode. > * > * When these conditions are not satisfied it drops into a standard > * receive procedure patterned after RFC793 to handle all cases. > @@ -6506,7 +6507,8 @@ void tcp_rcv_established(struct sock *sk, struct > sk_buff *skb) > > if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags && > TCP_SKB_CB(skb)->seq == tp->rcv_nxt && > - !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) { > + !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt) && > + !tp->repair) { > int tcp_header_len = tp->tcp_header_len; > s32 delta = 0; > int flag = 0; > @@ -6632,6 +6634,11 @@ void tcp_rcv_established(struct sock *sk, struct > sk_buff *skb) > goto discard; > } > > + if (tp->repair) { > + reason = SKB_DROP_REASON_SOCKET_REPAIR; > + goto discard; > + } > + > /* > * Standard slow path. > */ > @@ -7125,6 +7132,11 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff > *skb) > int queued = 0; > SKB_DR(reason); > > + if (tp->repair) { > + SKB_DR_SET(reason, SOCKET_REPAIR); > + goto discard; > + } > + > switch (sk->sk_state) { > case TCP_CLOSE: > SKB_DR_SET(reason, TCP_CLOSE); > -- > 2.43.0 >

