On Tue, May 19, 2026 at 9:12 AM Stefano Brivio <[email protected]> wrote:
>
> On Tue, 19 May 2026 08:36:27 -0700
> Andrei Vagin <[email protected]> wrote:
>
> > On Mon, May 18, 2026 at 4:34 AM Stefano Brivio <[email protected]> wrote:
> > >
> > > On Mon, 18 May 2026 00:57:16 -0700
> > > Eric Dumazet <[email protected]> wrote:
> > >
> > > > On Sun, May 17, 2026 at 12:53 PM Stefano Brivio <[email protected]> 
> > > > wrote:
> > > > >
> > > > > On Sun, 17 May 2026 12:05:45 -0700
> > > > > Kuniyuki Iwashima <[email protected]> wrote:
> > > > >
> > > > > > On Sun, May 17, 2026 at 11:41 AM Stefano Brivio 
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > Once a socket enters repair mode (TCP_REPAIR socket option with
> > > > > > > TCP_REPAIR_ON value), it's possible to dump the receive sequence
> > > > > > > number (TCP_QUEUE_SEQ) and the contents of the receive queue 
> > > > > > > itself
> > > > > > > (using TCP_REPAIR_QUEUE to select it).
> > > > > > >
> > > > > > > If we receive data after the application fetched the sequence 
> > > > > > > number
> > > > > > > or saved the contents of the queue, though, the application will 
> > > > > > > now
> > > > > > > have outdated information, which defeats the whole functionality,
> > > > > > > because this leads to gaps in sequence and data once they're 
> > > > > > > restored
> > > > > > > by the target instance of the application, resulting in a hanging 
> > > > > > > or
> > > > > > > otherwise non-functional TCP connection.
> > > > > > >
> > > > > > > This type of race condition was discovered in the KubeVirt 
> > > > > > > integration
> > > > > > > of passt(1), using a remote iperf3 client connected to an iperf3
> > > > > > > server running in the guest which is being migrated. The setup 
> > > > > > > allows
> > > > > > > traffic to reach the origin node hosting the guest during the
> > > > > > > migration.
> > > > > > >
> > > > > > > If passt dumps sequence number and contents of the queue *before*
> > > > > > > further data is received and acknowledged to the peer by the 
> > > > > > > kernel,
> > > > > > > once the TCP data connection is migrated to the target node, the
> > > > > > > remote client becomes unable to continue sending, because a 
> > > > > > > portion
> > > > > > > of the data it sent *and received an acknowledgement for* is now 
> > > > > > > lost.
> > > > > > >
> > > > > > > Schematically:
> > > > > > >
> > > > > > > 1. client --seq 1:100--> origin host --> passt --> guest --> 
> > > > > > > server
> > > > > > >
> > > > > > > 2. client <--ACK: 100-- origin host
> > > > > > >
> > > > > > > 3. migration starts,
> > > > > >
> > > > > > Here, a netfilter rule or bpf prog must be installed to
> > > > > > drop packets temporarily until migration completes.
> > > > >
> > > > > Thanks for the review.
> > > > >
> > > > > I have to say it's rather unexpected to have to work around obvious
> > > > > kernel issues in userspace: TCP_REPAIR implies that queues are frozen,
> > > > > and this is handled correctly on the sending side (see
> > > > > tcp_write_xmit()), but was clearly forgotten on the receiving side.
> > > >
> > > > Have you contacted TCP repair authors for best practices?
> > >
> > > I Cc'ed them here, Pavel is the author (but as far as I understand not
> > > active in kernel development anymore), and I know that Andrei did some
> > > substantial work on it in the past, so he's Cc'ed here as well.
> > >
> > > But we are following what CRIU (the userspace reference implementation)
> > > does, and CRIU would be affected by this issue as well (depending on
> > > usage).
> >
> > Before extracting the socket state, CRIU uses netfilter (iptables or
> > nftables) to block all traffic associated with the specific TCP
> > connection or, in the case of a container, the entire network namespace.
>
> ...by default, yes, by "depending on usage" I actually meant
> '--network-lock skip', but I have to admit I'm not sure how commonly
> that is used.
>
> > This approach provides two main benefits:
> >
> > During the dump, we don't need to leave all sockets in repair mode for
> > the entire duration of the dump.  We enable and disable repair mode just
> > to grab the state. It's simplify a roll back if something goes wrong.
>
> Thanks for clarifying this aspect, I wasn't sure whether it was done to
> make rollback easier.

I would rephrase that: it likely wasn't implemented originally because
Pavel may not have considered it necessary. When checkpointing a
container with a separate network namespace, blocking all traffic within
that namespace is straightforward. If CRIU crashes or is killed during a
dump, we simply unblock the traffic, and everything continues to
function without needing to manage individual connections.

As I mentioned before, I have no objections to the idea itself, moreover
it looks quite rational and will make the code safer regarding undesired
side effects/race conditions. Thanks for working on that.

Thanks,
Andrei

Reply via email to