Thanks for the patch! I've applied it on top of my 5.17.1 and I'm now getting a very very large number of:
[ 188.616220] openvswitch: netlink: ovs_ct_free_action at depth 3 After just 3 minutes of uptime I'm seeing 230 such lines but also: [ 108.726680] net_ratelimit: 6283 callbacks suppressed [ 114.515126] net_ratelimit: 3959 callbacks suppressed [ 120.523252] net_ratelimit: 4388 callbacks suppressed [ 127.458617] net_ratelimit: 2992 callbacks suppressed [ 133.543592] net_ratelimit: 2348 callbacks suppressed [ 140.099915] net_ratelimit: 5089 callbacks suppressed [ 145.116922] net_ratelimit: 6136 callbacks suppressed [ 150.551786] net_ratelimit: 5953 callbacks suppressed [ 155.555861] net_ratelimit: 5108 callbacks suppressed [ 162.740087] net_ratelimit: 6507 callbacks suppressed [ 168.059741] net_ratelimit: 3161 callbacks suppressed [ 173.092062] net_ratelimit: 4377 callbacks suppressed [ 178.104068] net_ratelimit: 3090 callbacks suppressed [ 183.135704] net_ratelimit: 2735 callbacks suppressed [ 188.616160] net_ratelimit: 2893 callbacks suppressed [ 193.628158] net_ratelimit: 4149 callbacks suppressed [ 199.128232] net_ratelimit: 3813 callbacks suppressed [ 204.136278] net_ratelimit: 2884 callbacks suppressed [ 211.443642] net_ratelimit: 5322 callbacks suppressed [ 216.643209] net_ratelimit: 5820 callbacks suppressed [ 221.648342] net_ratelimit: 4908 callbacks suppressed [ 226.656369] net_ratelimit: 4533 callbacks suppressed [ 231.680416] net_ratelimit: 4247 callbacks suppressed [ 237.172416] net_ratelimit: 2481 callbacks suppressed I did a 30min test run like earlier and I'm now seeing a tiny increase in kmalloc-256 of just 448KB :) So that fix definitely did the trick to plug that leak! Tested-by: Stéphane Graber <[email protected]> Stéphane On Wed, Mar 30, 2022 at 9:24 PM Ilya Maximets <[email protected]> wrote: > > On 3/31/22 03:00, Stéphane Graber wrote: > > So it looks like my main issue and the one mentioned in the PS section > > may actually be one and the same. > > > > I've run an experiment now where I monitor the normal leakage over > > 30min, getting me around 1.1GiB of extra kmalloc-256. > > I then repeated the experiment but with firewall (good old iptables) > > dropping all non-ICMP traffic headed to the OVN router addresses, this > > reduced the leakage to just 89MiB. > > > > So it looks like that traffic is getting lost into a redirection loop > > for a while in OVN/OVS, triggers the "openvswitch: ovs-system: > > deferred action limit reached, drop recirc action" entry and also > > leaks a bunch of memory in the process. We did attempt to pin the > > issue to this cause earlier by generating some traffic which would hit > > the recirc loop but that didn't cause a very visible increase at the > > time. Given the new data however, our attempt at manually triggering > > it on an otherwise unused network must have been flawed somehow. > > > > The amount of traffic dropped by the firewall is quite small too which > > suggests that things could be significantly worse. > > In this case I saw a total of 3478 packets dropped for a total of 183852 > > bytes. > > > > We'll be doing a bit more digging on the OVN side to see what may be > > causing this and will report back. > > > > Stéphane > > > > > > > > > > Stéphane > > > > On Wed, Mar 30, 2022 at 5:08 PM Stéphane Graber <[email protected]> wrote: > >> > >> Hello, > >> > >> I'm using OVS 2.17.0 combined with OVN 22.03.0 on Ubuntu 20.04 and a > >> 5.17.1 mainline kernel. > >> > >> I'm trying to debug a very very problematic kernel memory leak which > >> is happening in this environment. > >> Sadly I don't have a clear reproducer or much idea of when it first > >> appeared. I went through about a year of metrics and some amount of > >> leakage may always have been present, the magnitude of it just > >> changing recently making it such that my normal weekly server > >> maintenance is now not frequent enough to take care of it. > >> > >> Basically what I'm doing is running a LXD + OVN setup on 3 servers, > >> they all act as OVN chassis with various priorities to spread the > >> load. > >> All 3 servers also normally run a combination of containers and > >> virtual machines attached to about a dozen different OVN networks. > >> > >> What I'm seeing is about 2MB/s of memory leakage (kmalloc-256 slub) > >> which after enabling slub debugging can be tracked down to > >> nf_ct_tmpl_alloc kernel calls such as those made by the openvswitch > >> kernel module as part of its ovs_ct_copy_action function which is > >> exposed as OVS_ACTION_ATTR_CT to userspace through the openvswitch > >> netlink API. > >> > >> This means that in just a couple of days I'm dealing with just shy of > >> 40GiB of those kmalloc-256 entries. > >> Here is one of the servers which has been running for just 6 hours: > >> > >> ``` > >> root@abydos:~# uptime > >> 20:51:44 up 6:32, 1 user, load average: 6.63, 5.95, 5.26 > >> > >> root@abydos:~# slabtop -o -s c | head -n10 > >> Active / Total Objects (% used) : 24919212 / 25427299 (98.0%) > >> Active / Total Slabs (% used) : 541777 / 541777 (100.0%) > >> Active / Total Caches (% used) : 150 / 197 (76.1%) > >> Active / Total Size (% used) : 11576509.49K / 11680410.31K (99.1%) > >> Minimum / Average / Maximum Object : 0.01K / 0.46K / 50.52K > >> > >> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME > >> 13048822 13048737 99% 0.75K 310688 42 9942016K kmalloc-256 > >> 2959671 2677549 90% 0.10K 75889 39 303556K buffer_head > >> 460411 460411 100% 0.57K 16444 28 263104K radix_tree_node > >> > >> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/alloc_traces | > >> sort -rn | head -n5 > >> 12203895 nf_ct_tmpl_alloc+0x55/0xb0 [nf_conntrack] > >> age=57/2975485/5871871 pid=1964-2048 cpus=0-31 nodes=0-1 > >> 803599 metadata_dst_alloc+0x25/0x50 age=2/2663973/5873408 > >> pid=0-683331 cpus=0-31 nodes=0-1 > >> 32773 memcg_alloc_slab_cgroups+0x3d/0x90 age=386/4430883/5878515 > >> pid=1-731302 cpus=0-31 nodes=0-1 > >> 3861 do_seccomp+0xdb/0xb80 age=749613/4661870/5878386 > >> pid=752-648826 cpus=0-31 nodes=0-1 > >> 2314 device_add+0x504/0x920 age=751269/5665662/5883377 pid=1-648698 > >> cpus=0-31 nodes=0-1 > >> > >> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/free_traces | > >> sort -rn | head -n5 > >> 8129152 <not-available> age=4300785451 pid=0 cpus=0 nodes=0-1 > >> 2770915 reserve_sfa_size+0xdf/0x110 [openvswitch] > >> age=1912/2970717/5881994 pid=1964-2069 cpus=0-31 nodes=0-1 > >> 1621182 dst_destroy+0x70/0xd0 age=4/3065853/5883592 pid=0-733033 > >> cpus=0-31 nodes=0-1 > >> 288686 nf_ct_tmpl_free+0x1b/0x30 [nf_conntrack] > >> age=109/2985710/5879968 pid=0-733208 cpus=0-31 nodes=0-1 > >> 134435 ovs_nla_free_flow_actions+0x68/0x90 [openvswitch] > >> age=134/2955781/5883717 pid=0-733208 cpus=0-31 nodes=0-1 > >> ``` > >> > >> Here you can see 12M calls to nf_ct_tmpl_alloc but just 288k to > >> nf_ct_tmpl_free. > >> > >> Things I've done so far to try to isolate things: > >> > >> 1) I've evacuated all workloads from the server, so the only thing > >> running on it is OVS vswitchd. This did not change anything. > >> 2) I've added iptables/ip6tables raw table rules marking all traffic > >> as NOTRACK. This did not change anything. > >> 3) I've played with the chassis assignment. This does change things in > >> that a server with no active chassis will not show any leakage > >> (thankfully). The busier the network I move back to the host, the > >> faster the leakage. > >> > >> > >> I've had both Frode and Tom (Cced) assist with a variety of ideas and > >> questions but while we have found some unrelated OVS and kernel > >> issues, we're yet to figure this one out. So I wanted to reach out to > >> the wider community to see if anyone has either seen something like > >> this before or has suggestions as to where to look next. > >> > >> I can pretty easily rebuild the kernel, OVS or OVN and while this > >> cluster is a production environment, the fact that I can evacuate one > >> of the three servers with no user-visible impact makes it not too bad > >> to debug. Having to constantly reboot the entire setup to clear the > >> memory leak is the bigger annoyance right now :) > >> > >> Stéphane > >> > >> PS: The side OVS/kernel issue I'm referring to is > >> https://lore.kernel.org/netdev/[email protected]/ > >> which allowed us to track down an issue with OVN logical routers > >> properly responding to ICMP on their external address but getting into > >> a recirculation loop when any other kind of traffic is thrown at them > >> (instead of being immediately dropped or rejected). The kernel change > >> made it possible to track this down. > > Hi, Stéphane. > > Thanks for the report. > > I'm not sure if that's what you see, but I think that I found one > pretty serious memory leak in the openvswitch module. In short, > it leaks all the dynamically allocated memory for nested actions. > E.g. if you have a datapath flow with actions:clone(ct(commit)), > the structure allocated by nf_ct_tmpl_alloc for a nested ct() action > will not be freed. > > Could you try the change below? > > It will additionally print a rate-limited error, if the nested CT > encountered: > openvswitch: netlink: ovs_ct_free_action at depth 2 > > This message can be used to confirm the issue. > > The very last bit of the fix is not necessary, but I just spotted > the additional formatting issue and fixed it. > > --- > diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c > index 5176f6ccac8e..248c39da465e 100644 > --- a/net/openvswitch/flow_netlink.c > +++ b/net/openvswitch/flow_netlink.c > @@ -2317,6 +2317,65 @@ static struct sw_flow_actions > *nla_alloc_flow_actions(int size) > return sfa; > } > > +static void ovs_nla_free_flow_actions_nla(const struct nlattr *actions, > + int len, int depth); > + > +static void ovs_nla_free_check_pkt_len_action(const struct nlattr *action, > + int depth) > +{ > + const struct nlattr *a; > + int rem; > + > + nla_for_each_nested(a, action, rem) { > + switch (nla_type(a)) { > + case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_LESS_EQUAL: > + case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_GREATER: > + ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a), > + depth); > + break; > + } > + } > +} > + > +static void ovs_nla_free_clone_action(const struct nlattr *action, int depth) > +{ > + const struct nlattr *a; > + int rem; > + > + nla_for_each_nested(a, action, rem) { > + switch (nla_type(a)) { > + case OVS_CLONE_ATTR_EXEC: > + /* Real list of actions follows this attribute. > + * Free them and return. */ > + a = nla_next(a, &rem); > + ovs_nla_free_flow_actions_nla(a, rem, depth); > + return; > + } > + } > +} > + > +static void ovs_nla_free_dec_ttl_action(const struct nlattr *action, int > depth) > +{ > + const struct nlattr *a = nla_data(action); > + > + switch (nla_type(a)) { > + case OVS_DEC_TTL_ATTR_ACTION: > + ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a), depth); > + break; > + } > +} > + > +static void ovs_nla_free_sample_action(const struct nlattr *action, int > depth) > +{ > + const struct nlattr *a = nla_data(action); > + > + switch (nla_type(a)) { > + case OVS_SAMPLE_ATTR_ARG: > + ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a), depth); > + break; > + } > +} > + > static void ovs_nla_free_set_action(const struct nlattr *a) > { > const struct nlattr *ovs_key = nla_data(a); > @@ -2330,25 +2389,55 @@ static void ovs_nla_free_set_action(const struct > nlattr *a) > } > } > > -void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts) > +static void ovs_nla_free_flow_actions_nla(const struct nlattr *actions, > + int len, int depth) > { > const struct nlattr *a; > int rem; > > - if (!sf_acts) > + if (!actions) > return; > > - nla_for_each_attr(a, sf_acts->actions, sf_acts->actions_len, rem) { > + depth++; > + > + nla_for_each_attr(a, actions, len, rem) { > switch (nla_type(a)) { > - case OVS_ACTION_ATTR_SET: > - ovs_nla_free_set_action(a); > + case OVS_ACTION_ATTR_CHECK_PKT_LEN: > + ovs_nla_free_check_pkt_len_action(a, depth); > + break; > + > + case OVS_ACTION_ATTR_CLONE: > + ovs_nla_free_clone_action(a, depth); > break; > + > case OVS_ACTION_ATTR_CT: > + if (depth != 1) > + OVS_NLERR(true, > + "ovs_ct_free_action at depth %d", > depth); > ovs_ct_free_action(a); > break; > + > + case OVS_ACTION_ATTR_DEC_TTL: > + ovs_nla_free_dec_ttl_action(a, depth); > + break; > + > + case OVS_ACTION_ATTR_SAMPLE: > + ovs_nla_free_sample_action(a, depth); > + break; > + > + case OVS_ACTION_ATTR_SET: > + ovs_nla_free_set_action(a); > + break; > } > } > +} > + > +void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts) > +{ > + if (!sf_acts) > + return; > > + ovs_nla_free_flow_actions_nla(sf_acts->actions, sf_acts->actions_len, > 0); > kfree(sf_acts); > } > > @@ -3458,7 +3547,9 @@ static int clone_action_to_attr(const struct nlattr > *attr, > if (!start) > return -EMSGSIZE; > > - err = ovs_nla_put_actions(nla_data(attr), rem, skb); > + /* Skipping the OVS_CLONE_ATTR_EXEC that is always present. */ > + attr = nla_next(nla_data(attr), &rem); > + err = ovs_nla_put_actions(attr, rem, skb); > > if (err) > nla_nest_cancel(skb, start); > --- > > Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
