Re: [PATCH net-next 1/2] tcp: rehash onto different ECMP path on retransmit timeout

Neil Spring Wed, 08 Apr 2026 00:00:38 -0700

Thanks Eric for looking at this; sorry for the duplicate email as I
failed to reply all; comments below.


On Tue, Apr 7, 2026 at 6:09=E2=80=AFPM Eric Dumazet <[email protected]> w=
rote:
> On Tue, Apr 7, 2026 at 5:28=E2=80=AFPM Neil Spring <[email protected]> wr=
ote:
> >
> > Add sk_dst_reset() alongside sk_rethink_txhash() in the RTO, PLB,
> > and spurious-retrans paths so that the next transmit triggers a fresh
> > route lookup.  Propagate sk_txhash into fl6->mp_hash in
> > inet6_csk_route_req() and inet6_csk_route_socket() so
> > fib6_select_path() uses the socket's current hash for ECMP selection.
> >
> > The ir_iif update in tcp_check_req() covers both IPv4 and IPv6
> > because it was cleaner than gating on address family; IPv4 is
> > otherwise unaltered, and not having autoflowlabel in IPv4 means
> > I wouldn't expect a new path on timeout.
> >
> > It is possible that PLB does not need this (that there are other
> > methods of reacting to local congestion); I added the sk_dst_reset
> > for consistency.
> >
> > Signed-off-by: Neil Spring <[email protected]>
> > ---
> >  net/ipv4/tcp_input.c             | 4 +++-
> >  net/ipv4/tcp_minisocks.c         | 9 +++++++++
> >  net/ipv4/tcp_plb.c               | 1 +
> >  net/ipv4/tcp_timer.c             | 1 +
> >  net/ipv6/inet6_connection_sock.c | 8 ++++++++
> >  5 files changed, 22 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 7171442c3ed7..3d42ab45066c 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5014,8 +5014,10 @@ static void tcp_rcv_spurious_retrans(struct sock=
 *sk,
> >             skb->protocol =3D=3D htons(ETH_P_IPV6) &&
> >             (tcp_sk(sk)->inet_conn.icsk_ack.lrcv_flowlabel !=3D
> >              ntohl(ip6_flowlabel(ipv6_hdr(skb)))) &&
> > -           sk_rethink_txhash(sk))
> > +           sk_rethink_txhash(sk)) {
> >                 NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPDUPLICATEDATAR=
EHASH);
> > +               sk_dst_reset(sk);
> > +       }
> >
> >         /* Save last flowlabel after a spurious retrans. */
> >         tcp_save_lrcv_flowlabel(sk, skb);
> > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > index 199f0b579e89..ef4b3771e9d8 100644
> > --- a/net/ipv4/tcp_minisocks.c
> > +++ b/net/ipv4/tcp_minisocks.c
> > @@ -750,6 +750,15 @@ struct sock *tcp_check_req(struct sock *sk, struct=
 sk_buff *skb,
> >                  * Reset timer after retransmitting SYNACK, similar to
> >                  * the idea of fast retransmit in recovery.
> >                  */
> > +
>
> What is the following part doing?
> tcp_v6_init_req() uses something quite different before setting ir_iif
> A comment explaining the rationale would be nice.

Comment added in v2.  The behavior I thought I observed was that the
ir_iif goes into fl6->flowi6_oif in inet6_csk_route_req(), limiting options
to any next hop on that interface, so if we received a retransmitted syn
on a new interface, we should try to use the new interface for the
response.

>
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +               if (sk->sk_family =3D=3D AF_INET6)
> > +                       inet_rsk(req)->ir_iif =3D tcp_v6_iif(skb);
> > +               else
> > +#endif
> > +                       inet_rsk(req)->ir_iif =3D
> > +                               inet_request_bound_dev_if(sk, skb);
> > +
> >                 if (!tcp_oow_rate_limited(sock_net(sk), skb,
> >                                           LINUX_MIB_TCPACKSKIPPEDSYNREC=
V,
> >                                           &tcp_rsk(req)->last_oow_ack_t=
ime)) {
> > diff --git a/net/ipv4/tcp_plb.c b/net/ipv4/tcp_plb.c
> > index 68ccdb9a5412..d7cc00a58e53 100644
> > --- a/net/ipv4/tcp_plb.c
> > +++ b/net/ipv4/tcp_plb.c
> > @@ -79,6 +79,7 @@ void tcp_plb_check_rehash(struct sock *sk, struct tcp=
_plb_state *plb)
> >                 return;
> >
> >         sk_rethink_txhash(sk);
> > +       sk_dst_reset(sk);
> >         plb->consec_cong_rounds =3D 0;
> >         tcp_sk(sk)->plb_rehash++;
> >         NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPLBREHASH);
> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > index ea99988795e7..acc22fc532c2 100644
> > --- a/net/ipv4/tcp_timer.c
> > +++ b/net/ipv4/tcp_timer.c
> > @@ -299,6 +299,7 @@ static int tcp_write_timeout(struct sock *sk)
> >         if (sk_rethink_txhash(sk)) {
> >                 tp->timeout_rehash++;
> >                 __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTREHAS=
H);
> > +               sk_dst_reset(sk);
> >         }
> >
> >         return 0;
> > diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connecti=
on_sock.c
> > index 37534e116899..2fe753bb38b4 100644
> > --- a/net/ipv6/inet6_connection_sock.c
> > +++ b/net/ipv6/inet6_connection_sock.c
> > @@ -48,6 +48,11 @@ struct dst_entry *inet6_csk_route_req(const struct s=
ock *sk,
> >         fl6->flowi6_uid =3D sk_uid(sk);
> >         security_req_classify_flow(req, flowi6_to_flowi_common(fl6));
> >
> > +       if (req->num_retrans)
> > +               fl6->mp_hash =3D jhash_1word(req->num_retrans,
> > +                                          (__force u32)ireq->ir_rmt_po=
rt)
> > +                               >> 1;
>
> Why not setting mp_hash to sk_txhash ?

To be honest, I was trying to avoid adding #include <tcp.h> to the top of
inet6_connection_sock.c, which is needed to get tcp_rsk(req)->txhash, but
that does make better code, using that in v2.

> Why are you using ">> 1" ?

The >> 1 is due to rt6_multipath_hash returning its computed
value shifted right one before going into mp_hash in route.c sites.
I understood mp_hash can't have the high bit set and still work
due to fib6_select_path treating it as a signed integer when comparing
to fib_nh_upper_bound.

> rt6_multipath_hash() seems to be bypassed, it might be time to add a
> comment there
> explaining that mp_hash needs to be 31-bit only...
>
> Perhaps use rt6_multipath_hash() and expand it to use a socket pointer
> to retrieve sk->sk_txhash when/if possible
> instead of yet another flow dissection.

I think thyour comment above (grabbing txhash) avoids the dissection.

>
> > +
> >         if (!dst) {
> >                 dst =3D ip6_dst_lookup_flow(sock_net(sk), sk, fl6, fina=
l_p);
> >                 if (IS_ERR(dst))
> > @@ -70,6 +75,9 @@ struct dst_entry *inet6_csk_route_socket(struct sock =
*sk,
> >         fl6->saddr =3D np->saddr;
> >         fl6->flowlabel =3D np->flow_label;
> >         IP6_ECN_flow_xmit(sk, fl6->flowlabel);
> > +
> > +       if (sk->sk_txhash)
> > +               fl6->mp_hash =3D sk->sk_txhash >> 1;
>
> Seems inconsistent, and same question about the right shift.

Adding comment and standardizing on
* tcp_rsk(req)->txhash >> 1 for request sockets,
* sk->sk_txhash >> 1 for established sockets.

>
>
> >         fl6->flowi6_oif =3D sk->sk_bound_dev_if;
> >         fl6->flowi6_mark =3D sk->sk_mark;
> >         fl6->fl6_sport =3D inet->inet_sport;
> > --
> > 2.52.0
> >

On Tue, Apr 7, 2026 at 6:09 PM Eric Dumazet <[email protected]> wrote:
>
> >
> On Tue, Apr 7, 2026 at 5:28 PM Neil Spring <[email protected]> wrote:
> >
> > Add sk_dst_reset() alongside sk_rethink_txhash() in the RTO, PLB,
> > and spurious-retrans paths so that the next transmit triggers a fresh
> > route lookup.  Propagate sk_txhash into fl6->mp_hash in
> > inet6_csk_route_req() and inet6_csk_route_socket() so
> > fib6_select_path() uses the socket's current hash for ECMP selection.
> >
> > The ir_iif update in tcp_check_req() covers both IPv4 and IPv6
> > because it was cleaner than gating on address family; IPv4 is
> > otherwise unaltered, and not having autoflowlabel in IPv4 means
> > I wouldn't expect a new path on timeout.
> >
> > It is possible that PLB does not need this (that there are other
> > methods of reacting to local congestion); I added the sk_dst_reset
> > for consistency.
> >
> > Signed-off-by: Neil Spring <[email protected]>
> > ---
> >  net/ipv4/tcp_input.c             | 4 +++-
> >  net/ipv4/tcp_minisocks.c         | 9 +++++++++
> >  net/ipv4/tcp_plb.c               | 1 +
> >  net/ipv4/tcp_timer.c             | 1 +
> >  net/ipv6/inet6_connection_sock.c | 8 ++++++++
> >  5 files changed, 22 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 7171442c3ed7..3d42ab45066c 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5014,8 +5014,10 @@ static void tcp_rcv_spurious_retrans(struct sock *sk,
> >             skb->protocol == htons(ETH_P_IPV6) &&
> >             (tcp_sk(sk)->inet_conn.icsk_ack.lrcv_flowlabel !=
> >              ntohl(ip6_flowlabel(ipv6_hdr(skb)))) &&
> > -           sk_rethink_txhash(sk))
> > +           sk_rethink_txhash(sk)) {
> >                 NET_INC_STATS(sock_net(sk), 
> > LINUX_MIB_TCPDUPLICATEDATAREHASH);
> > +               sk_dst_reset(sk);
> > +       }
> >
> >         /* Save last flowlabel after a spurious retrans. */
> >         tcp_save_lrcv_flowlabel(sk, skb);
> > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > index 199f0b579e89..ef4b3771e9d8 100644
> > --- a/net/ipv4/tcp_minisocks.c
> > +++ b/net/ipv4/tcp_minisocks.c
> > @@ -750,6 +750,15 @@ struct sock *tcp_check_req(struct sock *sk, struct 
> > sk_buff *skb,
> >                  * Reset timer after retransmitting SYNACK, similar to
> >                  * the idea of fast retransmit in recovery.
> >                  */
> > +
>
> What is the following part doing?
> tcp_v6_init_req() uses something quite different before setting ir_iif
> A comment explaining the rationale would be nice.
>
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +               if (sk->sk_family == AF_INET6)
> > +                       inet_rsk(req)->ir_iif = tcp_v6_iif(skb);
> > +               else
> > +#endif
> > +                       inet_rsk(req)->ir_iif =
> > +                               inet_request_bound_dev_if(sk, skb);
> > +
> >                 if (!tcp_oow_rate_limited(sock_net(sk), skb,
> >                                           LINUX_MIB_TCPACKSKIPPEDSYNRECV,
> >                                           
> > &tcp_rsk(req)->last_oow_ack_time)) {
> > diff --git a/net/ipv4/tcp_plb.c b/net/ipv4/tcp_plb.c
> > index 68ccdb9a5412..d7cc00a58e53 100644
> > --- a/net/ipv4/tcp_plb.c
> > +++ b/net/ipv4/tcp_plb.c
> > @@ -79,6 +79,7 @@ void tcp_plb_check_rehash(struct sock *sk, struct 
> > tcp_plb_state *plb)
> >                 return;
> >
> >         sk_rethink_txhash(sk);
> > +       sk_dst_reset(sk);
> >         plb->consec_cong_rounds = 0;
> >         tcp_sk(sk)->plb_rehash++;
> >         NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPLBREHASH);
> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > index ea99988795e7..acc22fc532c2 100644
> > --- a/net/ipv4/tcp_timer.c
> > +++ b/net/ipv4/tcp_timer.c
> > @@ -299,6 +299,7 @@ static int tcp_write_timeout(struct sock *sk)
> >         if (sk_rethink_txhash(sk)) {
> >                 tp->timeout_rehash++;
> >                 __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTREHASH);
> > +               sk_dst_reset(sk);
> >         }
> >
> >         return 0;
> > diff --git a/net/ipv6/inet6_connection_sock.c 
> > b/net/ipv6/inet6_connection_sock.c
> > index 37534e116899..2fe753bb38b4 100644
> > --- a/net/ipv6/inet6_connection_sock.c
> > +++ b/net/ipv6/inet6_connection_sock.c
> > @@ -48,6 +48,11 @@ struct dst_entry *inet6_csk_route_req(const struct sock 
> > *sk,
> >         fl6->flowi6_uid = sk_uid(sk);
> >         security_req_classify_flow(req, flowi6_to_flowi_common(fl6));
> >
> > +       if (req->num_retrans)
> > +               fl6->mp_hash = jhash_1word(req->num_retrans,
> > +                                          (__force u32)ireq->ir_rmt_port)
> > +                               >> 1;
>
> Why not setting mp_hash to sk_txhash ?
>
> Why are you using ">> 1" ?
>
> rt6_multipath_hash() seems to be bypassed, it might be time to add a
> comment there
> explaining that mp_hash needs to be 31-bit only...
>
> Perhaps use rt6_multipath_hash() and expand it to use a socket pointer
> to retrieve sk->sk_txhash when/if possible
> instead of yet another flow dissection.
>
> > +
> >         if (!dst) {
> >                 dst = ip6_dst_lookup_flow(sock_net(sk), sk, fl6, final_p);
> >                 if (IS_ERR(dst))
> > @@ -70,6 +75,9 @@ struct dst_entry *inet6_csk_route_socket(struct sock *sk,
> >         fl6->saddr = np->saddr;
> >         fl6->flowlabel = np->flow_label;
> >         IP6_ECN_flow_xmit(sk, fl6->flowlabel);
> > +
> > +       if (sk->sk_txhash)
> > +               fl6->mp_hash = sk->sk_txhash >> 1;
>
> Seems inconsistent, and same question about the right shift.
>
>
> >         fl6->flowi6_oif = sk->sk_bound_dev_if;
> >         fl6->flowi6_mark = sk->sk_mark;
> >         fl6->fl6_sport = inet->inet_sport;
> > --
> > 2.52.0
> >

Re: [PATCH net-next 1/2] tcp: rehash onto different ECMP path on retransmit timeout

Reply via email to