Thanks Eric for looking at this; sorry for the duplicate email as I failed to reply all; comments below.
On Tue, Apr 7, 2026 at 6:09=E2=80=AFPM Eric Dumazet <[email protected]> w= rote: > On Tue, Apr 7, 2026 at 5:28=E2=80=AFPM Neil Spring <[email protected]> wr= ote: > > > > Add sk_dst_reset() alongside sk_rethink_txhash() in the RTO, PLB, > > and spurious-retrans paths so that the next transmit triggers a fresh > > route lookup. Propagate sk_txhash into fl6->mp_hash in > > inet6_csk_route_req() and inet6_csk_route_socket() so > > fib6_select_path() uses the socket's current hash for ECMP selection. > > > > The ir_iif update in tcp_check_req() covers both IPv4 and IPv6 > > because it was cleaner than gating on address family; IPv4 is > > otherwise unaltered, and not having autoflowlabel in IPv4 means > > I wouldn't expect a new path on timeout. > > > > It is possible that PLB does not need this (that there are other > > methods of reacting to local congestion); I added the sk_dst_reset > > for consistency. > > > > Signed-off-by: Neil Spring <[email protected]> > > --- > > net/ipv4/tcp_input.c | 4 +++- > > net/ipv4/tcp_minisocks.c | 9 +++++++++ > > net/ipv4/tcp_plb.c | 1 + > > net/ipv4/tcp_timer.c | 1 + > > net/ipv6/inet6_connection_sock.c | 8 ++++++++ > > 5 files changed, 22 insertions(+), 1 deletion(-) > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index 7171442c3ed7..3d42ab45066c 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -5014,8 +5014,10 @@ static void tcp_rcv_spurious_retrans(struct sock= *sk, > > skb->protocol =3D=3D htons(ETH_P_IPV6) && > > (tcp_sk(sk)->inet_conn.icsk_ack.lrcv_flowlabel !=3D > > ntohl(ip6_flowlabel(ipv6_hdr(skb)))) && > > - sk_rethink_txhash(sk)) > > + sk_rethink_txhash(sk)) { > > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPDUPLICATEDATAR= EHASH); > > + sk_dst_reset(sk); > > + } > > > > /* Save last flowlabel after a spurious retrans. */ > > tcp_save_lrcv_flowlabel(sk, skb); > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > > index 199f0b579e89..ef4b3771e9d8 100644 > > --- a/net/ipv4/tcp_minisocks.c > > +++ b/net/ipv4/tcp_minisocks.c > > @@ -750,6 +750,15 @@ struct sock *tcp_check_req(struct sock *sk, struct= sk_buff *skb, > > * Reset timer after retransmitting SYNACK, similar to > > * the idea of fast retransmit in recovery. > > */ > > + > > What is the following part doing? > tcp_v6_init_req() uses something quite different before setting ir_iif > A comment explaining the rationale would be nice. Comment added in v2. The behavior I thought I observed was that the ir_iif goes into fl6->flowi6_oif in inet6_csk_route_req(), limiting options to any next hop on that interface, so if we received a retransmitted syn on a new interface, we should try to use the new interface for the response. > > > +#if IS_ENABLED(CONFIG_IPV6) > > + if (sk->sk_family =3D=3D AF_INET6) > > + inet_rsk(req)->ir_iif =3D tcp_v6_iif(skb); > > + else > > +#endif > > + inet_rsk(req)->ir_iif =3D > > + inet_request_bound_dev_if(sk, skb); > > + > > if (!tcp_oow_rate_limited(sock_net(sk), skb, > > LINUX_MIB_TCPACKSKIPPEDSYNREC= V, > > &tcp_rsk(req)->last_oow_ack_t= ime)) { > > diff --git a/net/ipv4/tcp_plb.c b/net/ipv4/tcp_plb.c > > index 68ccdb9a5412..d7cc00a58e53 100644 > > --- a/net/ipv4/tcp_plb.c > > +++ b/net/ipv4/tcp_plb.c > > @@ -79,6 +79,7 @@ void tcp_plb_check_rehash(struct sock *sk, struct tcp= _plb_state *plb) > > return; > > > > sk_rethink_txhash(sk); > > + sk_dst_reset(sk); > > plb->consec_cong_rounds =3D 0; > > tcp_sk(sk)->plb_rehash++; > > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPLBREHASH); > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c > > index ea99988795e7..acc22fc532c2 100644 > > --- a/net/ipv4/tcp_timer.c > > +++ b/net/ipv4/tcp_timer.c > > @@ -299,6 +299,7 @@ static int tcp_write_timeout(struct sock *sk) > > if (sk_rethink_txhash(sk)) { > > tp->timeout_rehash++; > > __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTREHAS= H); > > + sk_dst_reset(sk); > > } > > > > return 0; > > diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connecti= on_sock.c > > index 37534e116899..2fe753bb38b4 100644 > > --- a/net/ipv6/inet6_connection_sock.c > > +++ b/net/ipv6/inet6_connection_sock.c > > @@ -48,6 +48,11 @@ struct dst_entry *inet6_csk_route_req(const struct s= ock *sk, > > fl6->flowi6_uid =3D sk_uid(sk); > > security_req_classify_flow(req, flowi6_to_flowi_common(fl6)); > > > > + if (req->num_retrans) > > + fl6->mp_hash =3D jhash_1word(req->num_retrans, > > + (__force u32)ireq->ir_rmt_po= rt) > > + >> 1; > > Why not setting mp_hash to sk_txhash ? To be honest, I was trying to avoid adding #include <tcp.h> to the top of inet6_connection_sock.c, which is needed to get tcp_rsk(req)->txhash, but that does make better code, using that in v2. > Why are you using ">> 1" ? The >> 1 is due to rt6_multipath_hash returning its computed value shifted right one before going into mp_hash in route.c sites. I understood mp_hash can't have the high bit set and still work due to fib6_select_path treating it as a signed integer when comparing to fib_nh_upper_bound. > rt6_multipath_hash() seems to be bypassed, it might be time to add a > comment there > explaining that mp_hash needs to be 31-bit only... > > Perhaps use rt6_multipath_hash() and expand it to use a socket pointer > to retrieve sk->sk_txhash when/if possible > instead of yet another flow dissection. I think thyour comment above (grabbing txhash) avoids the dissection. > > > + > > if (!dst) { > > dst =3D ip6_dst_lookup_flow(sock_net(sk), sk, fl6, fina= l_p); > > if (IS_ERR(dst)) > > @@ -70,6 +75,9 @@ struct dst_entry *inet6_csk_route_socket(struct sock = *sk, > > fl6->saddr =3D np->saddr; > > fl6->flowlabel =3D np->flow_label; > > IP6_ECN_flow_xmit(sk, fl6->flowlabel); > > + > > + if (sk->sk_txhash) > > + fl6->mp_hash =3D sk->sk_txhash >> 1; > > Seems inconsistent, and same question about the right shift. Adding comment and standardizing on * tcp_rsk(req)->txhash >> 1 for request sockets, * sk->sk_txhash >> 1 for established sockets. > > > > fl6->flowi6_oif =3D sk->sk_bound_dev_if; > > fl6->flowi6_mark =3D sk->sk_mark; > > fl6->fl6_sport =3D inet->inet_sport; > > -- > > 2.52.0 > > On Tue, Apr 7, 2026 at 6:09 PM Eric Dumazet <[email protected]> wrote: > > > > On Tue, Apr 7, 2026 at 5:28 PM Neil Spring <[email protected]> wrote: > > > > Add sk_dst_reset() alongside sk_rethink_txhash() in the RTO, PLB, > > and spurious-retrans paths so that the next transmit triggers a fresh > > route lookup. Propagate sk_txhash into fl6->mp_hash in > > inet6_csk_route_req() and inet6_csk_route_socket() so > > fib6_select_path() uses the socket's current hash for ECMP selection. > > > > The ir_iif update in tcp_check_req() covers both IPv4 and IPv6 > > because it was cleaner than gating on address family; IPv4 is > > otherwise unaltered, and not having autoflowlabel in IPv4 means > > I wouldn't expect a new path on timeout. > > > > It is possible that PLB does not need this (that there are other > > methods of reacting to local congestion); I added the sk_dst_reset > > for consistency. > > > > Signed-off-by: Neil Spring <[email protected]> > > --- > > net/ipv4/tcp_input.c | 4 +++- > > net/ipv4/tcp_minisocks.c | 9 +++++++++ > > net/ipv4/tcp_plb.c | 1 + > > net/ipv4/tcp_timer.c | 1 + > > net/ipv6/inet6_connection_sock.c | 8 ++++++++ > > 5 files changed, 22 insertions(+), 1 deletion(-) > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index 7171442c3ed7..3d42ab45066c 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -5014,8 +5014,10 @@ static void tcp_rcv_spurious_retrans(struct sock *sk, > > skb->protocol == htons(ETH_P_IPV6) && > > (tcp_sk(sk)->inet_conn.icsk_ack.lrcv_flowlabel != > > ntohl(ip6_flowlabel(ipv6_hdr(skb)))) && > > - sk_rethink_txhash(sk)) > > + sk_rethink_txhash(sk)) { > > NET_INC_STATS(sock_net(sk), > > LINUX_MIB_TCPDUPLICATEDATAREHASH); > > + sk_dst_reset(sk); > > + } > > > > /* Save last flowlabel after a spurious retrans. */ > > tcp_save_lrcv_flowlabel(sk, skb); > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > > index 199f0b579e89..ef4b3771e9d8 100644 > > --- a/net/ipv4/tcp_minisocks.c > > +++ b/net/ipv4/tcp_minisocks.c > > @@ -750,6 +750,15 @@ struct sock *tcp_check_req(struct sock *sk, struct > > sk_buff *skb, > > * Reset timer after retransmitting SYNACK, similar to > > * the idea of fast retransmit in recovery. > > */ > > + > > What is the following part doing? > tcp_v6_init_req() uses something quite different before setting ir_iif > A comment explaining the rationale would be nice. > > > +#if IS_ENABLED(CONFIG_IPV6) > > + if (sk->sk_family == AF_INET6) > > + inet_rsk(req)->ir_iif = tcp_v6_iif(skb); > > + else > > +#endif > > + inet_rsk(req)->ir_iif = > > + inet_request_bound_dev_if(sk, skb); > > + > > if (!tcp_oow_rate_limited(sock_net(sk), skb, > > LINUX_MIB_TCPACKSKIPPEDSYNRECV, > > > > &tcp_rsk(req)->last_oow_ack_time)) { > > diff --git a/net/ipv4/tcp_plb.c b/net/ipv4/tcp_plb.c > > index 68ccdb9a5412..d7cc00a58e53 100644 > > --- a/net/ipv4/tcp_plb.c > > +++ b/net/ipv4/tcp_plb.c > > @@ -79,6 +79,7 @@ void tcp_plb_check_rehash(struct sock *sk, struct > > tcp_plb_state *plb) > > return; > > > > sk_rethink_txhash(sk); > > + sk_dst_reset(sk); > > plb->consec_cong_rounds = 0; > > tcp_sk(sk)->plb_rehash++; > > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPLBREHASH); > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c > > index ea99988795e7..acc22fc532c2 100644 > > --- a/net/ipv4/tcp_timer.c > > +++ b/net/ipv4/tcp_timer.c > > @@ -299,6 +299,7 @@ static int tcp_write_timeout(struct sock *sk) > > if (sk_rethink_txhash(sk)) { > > tp->timeout_rehash++; > > __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTREHASH); > > + sk_dst_reset(sk); > > } > > > > return 0; > > diff --git a/net/ipv6/inet6_connection_sock.c > > b/net/ipv6/inet6_connection_sock.c > > index 37534e116899..2fe753bb38b4 100644 > > --- a/net/ipv6/inet6_connection_sock.c > > +++ b/net/ipv6/inet6_connection_sock.c > > @@ -48,6 +48,11 @@ struct dst_entry *inet6_csk_route_req(const struct sock > > *sk, > > fl6->flowi6_uid = sk_uid(sk); > > security_req_classify_flow(req, flowi6_to_flowi_common(fl6)); > > > > + if (req->num_retrans) > > + fl6->mp_hash = jhash_1word(req->num_retrans, > > + (__force u32)ireq->ir_rmt_port) > > + >> 1; > > Why not setting mp_hash to sk_txhash ? > > Why are you using ">> 1" ? > > rt6_multipath_hash() seems to be bypassed, it might be time to add a > comment there > explaining that mp_hash needs to be 31-bit only... > > Perhaps use rt6_multipath_hash() and expand it to use a socket pointer > to retrieve sk->sk_txhash when/if possible > instead of yet another flow dissection. > > > + > > if (!dst) { > > dst = ip6_dst_lookup_flow(sock_net(sk), sk, fl6, final_p); > > if (IS_ERR(dst)) > > @@ -70,6 +75,9 @@ struct dst_entry *inet6_csk_route_socket(struct sock *sk, > > fl6->saddr = np->saddr; > > fl6->flowlabel = np->flow_label; > > IP6_ECN_flow_xmit(sk, fl6->flowlabel); > > + > > + if (sk->sk_txhash) > > + fl6->mp_hash = sk->sk_txhash >> 1; > > Seems inconsistent, and same question about the right shift. > > > > fl6->flowi6_oif = sk->sk_bound_dev_if; > > fl6->flowi6_mark = sk->sk_mark; > > fl6->fl6_sport = inet->inet_sport; > > -- > > 2.52.0 > >
