Hello,On Mon, 24 Nov 2025, Pablo Neira Ayuso wrote: > On Sun, Oct 19, 2025 at 06:57:00PM +0300, Julian Anastasov wrote: > > Some places walk the services under mutex but they can just use RCU: > > > > * ip_vs_dst_event() uses ip_vs_forget_dev() which uses its own lock > > to modify dest > > * ip_vs_genl_dump_services(): ip_vs_genl_fill_service() just fills skb > > * ip_vs_genl_parse_service(): move RCU lock to callers > > ip_vs_genl_set_cmd(), ip_vs_genl_dump_dests() and ip_vs_genl_get_cmd() > > * ip_vs_genl_dump_dests(): just fill skb > > > > Signed-off-by: Julian Anastasov <[email protected]> > > --- > > net/netfilter/ipvs/ip_vs_ctl.c | 47 +++++++++++++++++----------------- > > 1 file changed, 23 insertions(+), 24 deletions(-) > > > > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c > > index 2fb9034b4f53..b18d08d79bcb 100644 > > --- a/net/netfilter/ipvs/ip_vs_ctl.c > > +++ b/net/netfilter/ipvs/ip_vs_ctl.c > > @@ -1759,23 +1759,21 @@ static int ip_vs_dst_event(struct notifier_block > > *this, unsigned long event, > > if (event != NETDEV_DOWN || !ipvs) > > return NOTIFY_DONE; > > IP_VS_DBG(3, "%s() dev=%s\n", __func__, dev->name); > > - mutex_lock(&ipvs->service_mutex); > > + rcu_read_lock(); > First, thanks for the review! > Control plane can still add destinations to svc->destinations that can > be skipped by the rcu walk. I think it should be possible to trigger > leaks with a sufficiently large list, given concurrent updates can > happen. ip_vs_forget_dev() has its own lock, but this is a per-dest > lock which is taken during the list walk. Even if dest is added, its dest_dst will be allocated on next packet. And this is not the real problem. The race is different: the first flow clears __LINK_STATE_START, IFF_UP and then notifies for NETDEV_DOWN ip_vs_dst_event() and then FIB. During ip_vs_dst_event() device is down but the routes are not dead yet. The second flow can call do_output_route4() and create again dest->dest_dst because the routes still work, see fib_lookup_good_nhc(). Even adding new dest can trigger this. It can happen even immediately after a mutex unlock in ip_vs_dst_event(). And the problem is that during the notification phase there is a gap where we can attach new routes while the device is marked down but it is not yet propagated to FIB. As we hold implicit reference to dev via dst, if due to race we miss to drop a route, the dev will be replaced with blackhole_netdev on NETDEV_UNREGISTER. So, the leak is until UNREGISTER or until next packet that will catch the dead route in __ip_vs_dst_check(). And if dest is deleted we always drop this route in __ip_vs_dst_cache_reset(), the leak is gone. One way to make this rt caching more robust is to add a netif_running() check together with the IP_VS_DEST_F_AVAILABLE check that is under dst_lock after the do_output_route4() call. By this way we synchronize with both dest deletion and device closing. I'll probably extend the patch with such change in the next days, including the same for IPv6: diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c index d57af95f1ebd..7b356ab8f439 100644 --- a/net/netfilter/ipvs/ip_vs_xmit.c +++ b/net/netfilter/ipvs/ip_vs_xmit.c @@ -295,6 +295,12 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs, return true; } +/* rt has device that is down */ +static bool rt_dev_is_down(const struct net_device *dev) +{ + return dev && !netif_running(dev); +} + /* Get route to destination or remote server */ static int __ip_vs_get_out_rt(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb, @@ -335,7 +341,8 @@ __ip_vs_get_out_rt(struct netns_ipvs *ipvs, int skb_af, struct sk_buff *skb, * for very short period and it must be checked under * dst_lock. */ - if (dest->flags & IP_VS_DEST_F_AVAILABLE) + if (dest->flags & IP_VS_DEST_F_AVAILABLE && + !rt_dev_is_down(rt->dst.dev)) __ip_vs_dst_set(dest, dest_dst, &rt->dst, 0); else noref = 0; > > @@ -3915,9 +3911,12 @@ static int ip_vs_genl_set_cmd(struct sk_buff *skb, > > struct genl_info *info) > > if (cmd == IPVS_CMD_NEW_SERVICE || cmd == IPVS_CMD_SET_SERVICE) > > need_full_svc = true; > > > > + /* We use function that requires RCU lock */ > > + rcu_read_lock(); > > This is ip_vs_genl_set_cmd path and the new per-netns mutex is held. > > I think __ip_vs_service_find() can now just access this mutex to check > if it is held, using fourth parameter: > > list_for_each_entry_rcu(..., lockdep_is_held(&ipvs->service_mutex)) > > Then this rcu_read_lock() after mutex_lock(&ipvs->service_mutex) can > be removed. I suspect you added it to quiet a rcu debugging splat. Yep, that is why the above comment. I'll check this... Regards -- Julian Anastasov <[email protected]>
