On Sun, Jun 14, 2026 at 6:54 AM Paolo Abeni <[email protected]> wrote: > > > > On 6/12/26 3:00 AM, Neil Spring wrote: > > Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO, PLB, > > and spurious-retransmission events, but the cached route is reused and > > the new hash is not propagated into the ECMP path selection logic. Two > > changes are needed to make rehash select a different local ECMP path: > > > > 1. Add __sk_dst_reset() alongside sk_rethink_txhash() in > > tcp_write_timeout(), tcp_rcv_spurious_retrans(), and > > tcp_plb_check_rehash() so the cached dst is invalidated and the > > next transmit triggers a fresh route lookup. > > > > 2. Set fl6->mp_hash from sk_txhash (or tcp_rsk(req)->txhash for > > SYN/ACK retransmits and syncookies) in tcp_v6_connect(), > > inet6_sk_rebuild_header(), inet6_csk_route_req(), > > inet6_csk_route_socket(), tcp_v6_send_response(), and > > cookie_v6_check() so fib6_select_path() picks a path based on the > > new hash. > > > > The mp_hash override only applies to fib_multipath_hash_policy 0 (the > > default L3 policy). Its hash includes the flow label, but that is 0 by > > default -- np->flow_label is unset, and auto_flowlabels only computes > > the on-wire label later, per packet -- so flows to the same peer share > > one local path. Keying the hash on sk_txhash makes the local path > > per-connection and lets a rehash re-select it. Policies 1-3 are left > > unchanged. > > > > The mp_hash assignment is factored into a small helper, > > ip6_ecmp_set_mp_hash(), shared by inet6_csk_route_req(), > > inet6_csk_route_socket(), tcp_v6_connect(), inet6_sk_rebuild_header(), > > tcp_v6_send_response(), and cookie_v6_check(). It applies > > (txhash >> 1) ?: 1 for policy 0 (the >> 1 keeps mp_hash in the 31-bit > > range; ?: 1 keeps it non-zero, since 0 would fall back to > > rt6_multipath_hash()). inet6_csk_route_socket() calls it only for > > sk_protocol == IPPROTO_TCP so that non-TCP callers (e.g., L2TP via > > inet6_csk_xmit) fall through to rt6_multipath_hash() and retain their > > existing flow-key-based ECMP behavior. > > > > tcp_v6_send_response() also sets mp_hash from the response txhash so > > that a control packet (a RST from the full socket, or an ACK from a > > time-wait socket) selects the same local ECMP nexthop as the > > connection's txhash rather than falling back to the flow hash. The > > time-wait socket's tw_txhash is copied from sk_txhash when the > > connection enters TIME_WAIT, so it reflects any rehash that occurred. > > > > Setting mp_hash explicitly is necessary because the default ECMP hash > > derives from fl6->flowlabel via np->flow_label, which is not updated > > from sk_txhash (REPFLOW is off by default). ip6_make_flowlabel() > > cannot help either, as it runs after the route lookup. > > > > As a consequence, for policy 0 the local ECMP path of an IPv6 TCP > > flow follows sk_txhash even when fl6->flowlabel is non-zero, e.g. a > > reflected (REPFLOW) or explicitly set (IPV6_FLOWLABEL_MGR) flow > > label. This is intentional: only local path selection changes, so > > rehash can recover from a failed path; the on-wire flow label is > > unchanged. > > > > sk_set_txhash() is moved before ip6_dst_lookup_flow() in > > tcp_v6_connect() so the initial ECMP path is selected by the same > > txhash that subsequent route rebuilds will use. This avoids > > unintended path changes when the cached dst is naturally invalidated > > (e.g., by PMTU discovery or route changes). > > > > The rehash sites (tcp_write_timeout(), tcp_plb_check_rehash(), and > > tcp_rcv_spurious_retrans()) call __sk_rethink_txhash_reset_dst(), > > which re-rolls the txhash and, when it changed, drops the cached dst > > so the next transmit re-runs route selection. The dst reset is > > guarded by sk->sk_family == AF_INET6 since IPv4 ECMP does not > > currently use sk_txhash for path selection. For IPv4-mapped IPv6 > > sockets this produces a redundant dst reset on a cold path > > (RTO/PLB); the subsequent IPv4 route lookup returns the same result. > > The helper is deliberately separate from sk_rethink_txhash() itself: > > dst_negative_advice() calls sk_rethink_txhash() before its own dst op, > > so resetting the dst inside sk_rethink_txhash() would skip that op > > (e.g. rt6_remove_exception_rt()). > > > > For syncookies, cookie_init_sequence() computes the cookie value > > before route_req() and sets txhash so the SYN-ACK selects the same > > ECMP path that cookie_v6_check() will use when the full socket is > > created. cookie_tcp_reqsk_init() derives txhash from the cookie so > > the full socket's ECMP path matches the SYN-ACK. Both the SYN-ACK > > assignment in tcp_conn_request() and the full-socket assignment in > > cookie_tcp_reqsk_init() are keyed on the packet family > > (skb->protocol == ETH_P_IPV6), not sk->sk_family: a dual-stack > > AF_INET6 listener also serves IPv4 connections, and the v4 cookie has > > mssind bits that would bias TX queue distribution if used as txhash. > > IPv4 connections retain net_tx_rndhash(). > > > > cookie_init_sequence() is split from the former version that also > > called tcp_synq_overflow() and incremented SYNCOOKIESSENT; those > > side effects are now in cookie_record_sent(), called after > > route_req() succeeds so they are not bumped when route_req() fails. > > cookie_record_sent() is guarded by CONFIG_SYN_COOKIES to > > match the guard on tcp_synq_overflow(). route_req() receives 0 as > > tw_isn for the syncookie path so that tcp_v6_init_req() still saves > > ireq->pktopts for REPFLOW flowlabel reflection and IPv6 cmsg > > options. The ecn_ok clear for syncookies without timestamps stays > > after tcp_ecn_create_request() so it takes precedence. > > > > Signed-off-by: Neil Spring <[email protected]> > > This looks good to me, with a minor commit below. > > > diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c > > index df479277fb80..cc71d84df42b 100644 > > --- a/net/ipv4/syncookies.c > > +++ b/net/ipv4/syncookies.c > > @@ -280,9 +280,18 @@ static int cookie_tcp_reqsk_init(struct sock *sk, > > struct sk_buff *skb, > > treq->snt_synack = 0; > > treq->snt_tsval_first = 0; > > treq->tfo_listener = false; > > - treq->txhash = net_tx_rndhash(); > > treq->rcv_isn = ntohl(th->seq) - 1; > > treq->snt_isn = ntohl(th->ack_seq) - 1; > > + if (skb->protocol == htons(ETH_P_IPV6)) { > > + /* Use the cookie as txhash so the ECMP path matches > > + * the SYN-ACK, where txhash was also set to the > > + * cookie. The original request socket (and its > > + * txhash) was freed after sending the SYN-ACK. > > + */ > > + treq->txhash = treq->snt_isn; > > + } else { > > + treq->txhash = net_tx_rndhash(); > > I'm wondering if it would make sense always using snt_isn for txhash in > the syn cookie case, regardless of the IP protocol. Beyond reducing the > differences between ipv4 and ipv6 it will make the code a little simpler.
I added the conditional in v11 after the AI review poked at how this wasn't plumbed for IPv4; it seemed safer and clearer about v6-vs-v4 use of txhash, but I wasn't weighing a goal of avoiding protocol-specific paths. I will try no-conditional again in v14 with this feedback, happy to change back if anyone has a strong opinion. -neil > > Not a blocker in any case. > > Still I think this could deserve an explicit ack from Eric. > > /P >
