On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <[email protected]> wrote:
>
> Once a socket enters repair mode (TCP_REPAIR socket option with
> TCP_REPAIR_ON value), it's possible to dump the receive sequence
> number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
> (using TCP_REPAIR_QUEUE to select it).
>
> If we receive data after the application fetched the sequence number
> or saved the contents of the queue, though, the application will now
> have outdated information, which defeats the whole functionality,
> because this leads to gaps in sequence and data once they're restored
> by the target instance of the application, resulting in a hanging or
> otherwise non-functional TCP connection.
>
> This type of race condition was discovered in the KubeVirt integration
> of passt(1), using a remote iperf3 client connected to an iperf3
> server running in the guest which is being migrated. The setup allows
> traffic to reach the origin node hosting the guest during the
> migration.
>
> If passt dumps sequence number and contents of the queue *before*
> further data is received and acknowledged to the peer by the kernel,
> once the TCP data connection is migrated to the target node, the
> remote client becomes unable to continue sending, because a portion
> of the data it sent *and received an acknowledgement for* is now lost.
>
> Schematically:
>
> 1. client --seq 1:100--> origin host --> passt --> guest --> server
>
> 2. client <--ACK: 100-- origin host
>
> 3. migration starts,

Here, a netfilter rule or bpf prog must be installed to
drop packets temporarily until migration completes.

We do not want unlikely tests in the fast path.

You can find a similar issue:
https://lore.kernel.org/netdev/[email protected]/

> passt enables repair mode, dumps the sequence
>    number (101) and sends it to the target node of the guest migration
>
> 4. client --seq 101:201--> origin host (passt not receiving anymore)
>
> 5. client <--ACK: 201-- origin host
>
> 6. migration completes, and passt restores sequence number 101 on the
>    migrated socket
>
> 7. client --seq 201:301--> target host (now seeing a sequence jump)
>
> 8. client <--ACK: 100-- target host
>
> ...and the connection can't recover anymore, because the client can't
> resend data that was already (erroneously) acknowledged. We need to
> avoid step 5. above.
>
> This would equally affect CRIU (the other known user of TCP_REPAIR),
> should data be received while the original container is frozen: the
> sequence dumped and the contents of the saved incoming queue would
> then depend on the timing.
>
> The race condition is also illustrated in the kselftests introduced
> by the next patch.
>
> To prevent this issue, discard data received for a socket in repair
> mode, with a new reason, SKB_DROP_REASON_SOCKET_REPAIR.
>
> Fixes: ee9952831cfd ("tcp: Initial repair mode")
> Tested-by: Laurent Vivier <[email protected]>
> Signed-off-by: Stefano Brivio <[email protected]>
> ---
>  include/net/dropreason-core.h |  3 +++
>  net/ipv4/tcp_input.c          | 14 +++++++++++++-
>  2 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
> index 2f312d1f67d6..19ab9e6ffc33 100644
> --- a/include/net/dropreason-core.h
> +++ b/include/net/dropreason-core.h
> @@ -9,6 +9,7 @@
>         FN(SOCKET_CLOSE)                \
>         FN(SOCKET_FILTER)               \
>         FN(SOCKET_RCVBUFF)              \
> +       FN(SOCKET_REPAIR)               \
>         FN(UNIX_DISCONNECT)             \
>         FN(UNIX_SKIP_OOB)               \
>         FN(PKT_TOO_SMALL)               \
> @@ -158,6 +159,8 @@ enum skb_drop_reason {
>         SKB_DROP_REASON_SOCKET_FILTER,
>         /** @SKB_DROP_REASON_SOCKET_RCVBUFF: socket receive buff is full */
>         SKB_DROP_REASON_SOCKET_RCVBUFF,
> +       /** @SKB_DROP_REASON_SOCKET_REPAIR: socket is in repair mode */
> +       SKB_DROP_REASON_SOCKET_REPAIR,
>         /**
>          * @SKB_DROP_REASON_UNIX_DISCONNECT: recv queue is purged when 
> SOCK_DGRAM
>          * or SOCK_SEQPACKET socket re-connect()s to another socket or notices
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index d5c9e65d9760..6eca34274f97 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6457,6 +6457,7 @@ static bool tcp_validate_incoming(struct sock *sk, 
> struct sk_buff *skb,
>   *       or pure receivers (this means either the sequence number or the ack
>   *       value must stay constant)
>   *     - Unexpected TCP option.
> + *     - Socket is in repair mode.
>   *
>   *     When these conditions are not satisfied it drops into a standard
>   *     receive procedure patterned after RFC793 to handle all cases.
> @@ -6506,7 +6507,8 @@ void tcp_rcv_established(struct sock *sk, struct 
> sk_buff *skb)
>
>         if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
>             TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
> -           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
> +           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt) &&
> +           !tp->repair) {
>                 int tcp_header_len = tp->tcp_header_len;
>                 s32 delta = 0;
>                 int flag = 0;
> @@ -6632,6 +6634,11 @@ void tcp_rcv_established(struct sock *sk, struct 
> sk_buff *skb)
>                 goto discard;
>         }
>
> +       if (tp->repair) {
> +               reason = SKB_DROP_REASON_SOCKET_REPAIR;
> +               goto discard;
> +       }
> +
>         /*
>          *      Standard slow path.
>          */
> @@ -7125,6 +7132,11 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff 
> *skb)
>         int queued = 0;
>         SKB_DR(reason);
>
> +       if (tp->repair) {
> +               SKB_DR_SET(reason, SOCKET_REPAIR);
> +               goto discard;
> +       }
> +
>         switch (sk->sk_state) {
>         case TCP_CLOSE:
>                 SKB_DR_SET(reason, TCP_CLOSE);
> --
> 2.43.0
>

Reply via email to