Thanks Frode, can those drop rules be added in such a way so as not to prevent setting up an OVN load balancer or DNAT entry on the logical router's IP from working?
Thanks Tom On Fri, 1 Apr 2022 at 06:35, Frode Nordahl <[email protected]> wrote: > On Thu, Mar 31, 2022 at 5:06 PM Stéphane Graber <[email protected]> > wrote: > > > > On Thu, Mar 31, 2022 at 6:55 AM Ilya Maximets <[email protected]> > wrote: > > > > > > On 3/31/22 06:24, Stéphane Graber wrote: > > > > Thanks for the patch! > > > > > > > > I've applied it on top of my 5.17.1 and I'm now getting a very very > > > > large number of: > > > > > > > > [ 188.616220] openvswitch: netlink: ovs_ct_free_action at depth 3 > > > > > > > > After just 3 minutes of uptime I'm seeing 230 such lines but also: > > > > > > > > [ 108.726680] net_ratelimit: 6283 callbacks suppressed > > > > [ 114.515126] net_ratelimit: 3959 callbacks suppressed > > > > [ 120.523252] net_ratelimit: 4388 callbacks suppressed > > > > [ 127.458617] net_ratelimit: 2992 callbacks suppressed > > > > [ 133.543592] net_ratelimit: 2348 callbacks suppressed > > > > [ 140.099915] net_ratelimit: 5089 callbacks suppressed > > > > [ 145.116922] net_ratelimit: 6136 callbacks suppressed > > > > [ 150.551786] net_ratelimit: 5953 callbacks suppressed > > > > [ 155.555861] net_ratelimit: 5108 callbacks suppressed > > > > [ 162.740087] net_ratelimit: 6507 callbacks suppressed > > > > [ 168.059741] net_ratelimit: 3161 callbacks suppressed > > > > [ 173.092062] net_ratelimit: 4377 callbacks suppressed > > > > [ 178.104068] net_ratelimit: 3090 callbacks suppressed > > > > [ 183.135704] net_ratelimit: 2735 callbacks suppressed > > > > [ 188.616160] net_ratelimit: 2893 callbacks suppressed > > > > [ 193.628158] net_ratelimit: 4149 callbacks suppressed > > > > [ 199.128232] net_ratelimit: 3813 callbacks suppressed > > > > [ 204.136278] net_ratelimit: 2884 callbacks suppressed > > > > [ 211.443642] net_ratelimit: 5322 callbacks suppressed > > > > [ 216.643209] net_ratelimit: 5820 callbacks suppressed > > > > [ 221.648342] net_ratelimit: 4908 callbacks suppressed > > > > [ 226.656369] net_ratelimit: 4533 callbacks suppressed > > > > [ 231.680416] net_ratelimit: 4247 callbacks suppressed > > > > [ 237.172416] net_ratelimit: 2481 callbacks suppressed > > > > > > > > I did a 30min test run like earlier and I'm now seeing a tiny > increase > > > > in kmalloc-256 of just 448KB :) > > > > > > > > So that fix definitely did the trick to plug that leak! > > > > > > > > Tested-by: Stéphane Graber <[email protected]> > > > > > > Thanks! > > > > > > That fast flow rotation on a system though is not a healthy sign. > > > But, at least, it doesn't leak the memory anymore. > > > > > > May I ask you to find the datapath flow that triggers the issue? > > > I wasn't able to reproduce it with OVN system tests, so I'm curious > > > how it looks like. The flow should have a nested clone() or > > > check_pkt_len() actions with the ct() action inside. 'depth 3' > > > suggests that it is double nested, i.e. clone(...clone(...ct())) or > > > clone(...check_pkt_len(...ct())) or something similar. ovs-dpctl > > > can be used to dump flows. > > > > So I've run (a rather crazy) "while :; do ovs-dpctl dump-flows | grep > > clone; done" on all 3 servers and didn't see a single hit. > > > > Instead I went to grep for one of the addresses directly and I'm seeing: > > https://gist.github.com/stgraber/4c7fe7b4e45cd3fa36d54759091bb2c0 > > > > I don't know if that's telling you anything useful, but it appears to > > be a couple of machines trying to access a TCP port on the OVN router. > > My OVN router addresses are publicly reachable (45.45.148.136) in this > > case, so some amount of port scanning and the like is somewhat common > > and the likely cause of this traffic in the first place. > > FWIW; I've scheduled time to look into making OVN add drop rules so > that the recirculation until TTL is 0 does not happen when TCP/UDP > traffic is directed at the LRP address. Hopefully it will be simple, > but the LRP address is also used for SNAT and needs to let conntrack > accept return traffic, so there may be more to it. I'll find out when > I start unraveling. > > -- > Frode Nordahl > > > > For now, I'll work on a proper patch for the kernel. > > > > Thanks! > > > > Stéphane > > > > > Best regards, Ilya Maximets. > > > > > > > > > > > Stéphane > > > > > > > > > > > > On Wed, Mar 30, 2022 at 9:24 PM Ilya Maximets <[email protected]> > wrote: > > > >> > > > >> On 3/31/22 03:00, Stéphane Graber wrote: > > > >>> So it looks like my main issue and the one mentioned in the PS > section > > > >>> may actually be one and the same. > > > >>> > > > >>> I've run an experiment now where I monitor the normal leakage over > > > >>> 30min, getting me around 1.1GiB of extra kmalloc-256. > > > >>> I then repeated the experiment but with firewall (good old > iptables) > > > >>> dropping all non-ICMP traffic headed to the OVN router addresses, > this > > > >>> reduced the leakage to just 89MiB. > > > >>> > > > >>> So it looks like that traffic is getting lost into a redirection > loop > > > >>> for a while in OVN/OVS, triggers the "openvswitch: ovs-system: > > > >>> deferred action limit reached, drop recirc action" entry and also > > > >>> leaks a bunch of memory in the process. We did attempt to pin the > > > >>> issue to this cause earlier by generating some traffic which would > hit > > > >>> the recirc loop but that didn't cause a very visible increase at > the > > > >>> time. Given the new data however, our attempt at manually > triggering > > > >>> it on an otherwise unused network must have been flawed somehow. > > > >>> > > > >>> The amount of traffic dropped by the firewall is quite small too > which > > > >>> suggests that things could be significantly worse. > > > >>> In this case I saw a total of 3478 packets dropped for a total of > 183852 bytes. > > > >>> > > > >>> We'll be doing a bit more digging on the OVN side to see what may > be > > > >>> causing this and will report back. > > > >>> > > > >>> Stéphane > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> Stéphane > > > >>> > > > >>> On Wed, Mar 30, 2022 at 5:08 PM Stéphane Graber < > [email protected]> wrote: > > > >>>> > > > >>>> Hello, > > > >>>> > > > >>>> I'm using OVS 2.17.0 combined with OVN 22.03.0 on Ubuntu 20.04 > and a > > > >>>> 5.17.1 mainline kernel. > > > >>>> > > > >>>> I'm trying to debug a very very problematic kernel memory leak > which > > > >>>> is happening in this environment. > > > >>>> Sadly I don't have a clear reproducer or much idea of when it > first > > > >>>> appeared. I went through about a year of metrics and some amount > of > > > >>>> leakage may always have been present, the magnitude of it just > > > >>>> changing recently making it such that my normal weekly server > > > >>>> maintenance is now not frequent enough to take care of it. > > > >>>> > > > >>>> Basically what I'm doing is running a LXD + OVN setup on 3 > servers, > > > >>>> they all act as OVN chassis with various priorities to spread the > > > >>>> load. > > > >>>> All 3 servers also normally run a combination of containers and > > > >>>> virtual machines attached to about a dozen different OVN networks. > > > >>>> > > > >>>> What I'm seeing is about 2MB/s of memory leakage (kmalloc-256 > slub) > > > >>>> which after enabling slub debugging can be tracked down to > > > >>>> nf_ct_tmpl_alloc kernel calls such as those made by the > openvswitch > > > >>>> kernel module as part of its ovs_ct_copy_action function which is > > > >>>> exposed as OVS_ACTION_ATTR_CT to userspace through the openvswitch > > > >>>> netlink API. > > > >>>> > > > >>>> This means that in just a couple of days I'm dealing with just > shy of > > > >>>> 40GiB of those kmalloc-256 entries. > > > >>>> Here is one of the servers which has been running for just 6 > hours: > > > >>>> > > > >>>> ``` > > > >>>> root@abydos:~# uptime > > > >>>> 20:51:44 up 6:32, 1 user, load average: 6.63, 5.95, 5.26 > > > >>>> > > > >>>> root@abydos:~# slabtop -o -s c | head -n10 > > > >>>> Active / Total Objects (% used) : 24919212 / 25427299 (98.0%) > > > >>>> Active / Total Slabs (% used) : 541777 / 541777 (100.0%) > > > >>>> Active / Total Caches (% used) : 150 / 197 (76.1%) > > > >>>> Active / Total Size (% used) : 11576509.49K / 11680410.31K > (99.1%) > > > >>>> Minimum / Average / Maximum Object : 0.01K / 0.46K / 50.52K > > > >>>> > > > >>>> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME > > > >>>> 13048822 13048737 99% 0.75K 310688 42 9942016K > kmalloc-256 > > > >>>> 2959671 2677549 90% 0.10K 75889 39 303556K > buffer_head > > > >>>> 460411 460411 100% 0.57K 16444 28 263104K > radix_tree_node > > > >>>> > > > >>>> root@abydos:~# cat > /sys/kernel/debug/slab/kmalloc-256/alloc_traces | > > > >>>> sort -rn | head -n5 > > > >>>> 12203895 nf_ct_tmpl_alloc+0x55/0xb0 [nf_conntrack] > > > >>>> age=57/2975485/5871871 pid=1964-2048 cpus=0-31 nodes=0-1 > > > >>>> 803599 metadata_dst_alloc+0x25/0x50 age=2/2663973/5873408 > > > >>>> pid=0-683331 cpus=0-31 nodes=0-1 > > > >>>> 32773 memcg_alloc_slab_cgroups+0x3d/0x90 age=386/4430883/5878515 > > > >>>> pid=1-731302 cpus=0-31 nodes=0-1 > > > >>>> 3861 do_seccomp+0xdb/0xb80 age=749613/4661870/5878386 > > > >>>> pid=752-648826 cpus=0-31 nodes=0-1 > > > >>>> 2314 device_add+0x504/0x920 age=751269/5665662/5883377 > pid=1-648698 > > > >>>> cpus=0-31 nodes=0-1 > > > >>>> > > > >>>> root@abydos:~# cat > /sys/kernel/debug/slab/kmalloc-256/free_traces | > > > >>>> sort -rn | head -n5 > > > >>>> 8129152 <not-available> age=4300785451 pid=0 cpus=0 nodes=0-1 > > > >>>> 2770915 reserve_sfa_size+0xdf/0x110 [openvswitch] > > > >>>> age=1912/2970717/5881994 pid=1964-2069 cpus=0-31 nodes=0-1 > > > >>>> 1621182 dst_destroy+0x70/0xd0 age=4/3065853/5883592 pid=0-733033 > > > >>>> cpus=0-31 nodes=0-1 > > > >>>> 288686 nf_ct_tmpl_free+0x1b/0x30 [nf_conntrack] > > > >>>> age=109/2985710/5879968 pid=0-733208 cpus=0-31 nodes=0-1 > > > >>>> 134435 ovs_nla_free_flow_actions+0x68/0x90 [openvswitch] > > > >>>> age=134/2955781/5883717 pid=0-733208 cpus=0-31 nodes=0-1 > > > >>>> ``` > > > >>>> > > > >>>> Here you can see 12M calls to nf_ct_tmpl_alloc but just 288k to > nf_ct_tmpl_free. > > > >>>> > > > >>>> Things I've done so far to try to isolate things: > > > >>>> > > > >>>> 1) I've evacuated all workloads from the server, so the only thing > > > >>>> running on it is OVS vswitchd. This did not change anything. > > > >>>> 2) I've added iptables/ip6tables raw table rules marking all > traffic > > > >>>> as NOTRACK. This did not change anything. > > > >>>> 3) I've played with the chassis assignment. This does change > things in > > > >>>> that a server with no active chassis will not show any leakage > > > >>>> (thankfully). The busier the network I move back to the host, the > > > >>>> faster the leakage. > > > >>>> > > > >>>> > > > >>>> I've had both Frode and Tom (Cced) assist with a variety of ideas > and > > > >>>> questions but while we have found some unrelated OVS and kernel > > > >>>> issues, we're yet to figure this one out. So I wanted to reach > out to > > > >>>> the wider community to see if anyone has either seen something > like > > > >>>> this before or has suggestions as to where to look next. > > > >>>> > > > >>>> I can pretty easily rebuild the kernel, OVS or OVN and while this > > > >>>> cluster is a production environment, the fact that I can evacuate > one > > > >>>> of the three servers with no user-visible impact makes it not too > bad > > > >>>> to debug. Having to constantly reboot the entire setup to clear > the > > > >>>> memory leak is the bigger annoyance right now :) > > > >>>> > > > >>>> Stéphane > > > >>>> > > > >>>> PS: The side OVS/kernel issue I'm referring to is > > > >>>> > https://lore.kernel.org/netdev/[email protected]/ > > > >>>> which allowed us to track down an issue with OVN logical routers > > > >>>> properly responding to ICMP on their external address but getting > into > > > >>>> a recirculation loop when any other kind of traffic is thrown at > them > > > >>>> (instead of being immediately dropped or rejected). The kernel > change > > > >>>> made it possible to track this down. > > > >> > > > >> Hi, Stéphane. > > > >> > > > >> Thanks for the report. > > > >> > > > >> I'm not sure if that's what you see, but I think that I found one > > > >> pretty serious memory leak in the openvswitch module. In short, > > > >> it leaks all the dynamically allocated memory for nested actions. > > > >> E.g. if you have a datapath flow with actions:clone(ct(commit)), > > > >> the structure allocated by nf_ct_tmpl_alloc for a nested ct() action > > > >> will not be freed. > > > >> > > > >> Could you try the change below? > > > >> > > > >> It will additionally print a rate-limited error, if the nested CT > > > >> encountered: > > > >> openvswitch: netlink: ovs_ct_free_action at depth 2 > > > >> > > > >> This message can be used to confirm the issue. > > > >> > > > >> The very last bit of the fix is not necessary, but I just spotted > > > >> the additional formatting issue and fixed it. > > > >> > > > >> --- > > > >> diff --git a/net/openvswitch/flow_netlink.c > b/net/openvswitch/flow_netlink.c > > > >> index 5176f6ccac8e..248c39da465e 100644 > > > >> --- a/net/openvswitch/flow_netlink.c > > > >> +++ b/net/openvswitch/flow_netlink.c > > > >> @@ -2317,6 +2317,65 @@ static struct sw_flow_actions > *nla_alloc_flow_actions(int size) > > > >> return sfa; > > > >> } > > > >> > > > >> +static void ovs_nla_free_flow_actions_nla(const struct nlattr > *actions, > > > >> + int len, int depth); > > > >> + > > > >> +static void ovs_nla_free_check_pkt_len_action(const struct nlattr > *action, > > > >> + int depth) > > > >> +{ > > > >> + const struct nlattr *a; > > > >> + int rem; > > > >> + > > > >> + nla_for_each_nested(a, action, rem) { > > > >> + switch (nla_type(a)) { > > > >> + case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_LESS_EQUAL: > > > >> + case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_GREATER: > > > >> + ovs_nla_free_flow_actions_nla(nla_data(a), > nla_len(a), > > > >> + depth); > > > >> + break; > > > >> + } > > > >> + } > > > >> +} > > > >> + > > > >> +static void ovs_nla_free_clone_action(const struct nlattr *action, > int depth) > > > >> +{ > > > >> + const struct nlattr *a; > > > >> + int rem; > > > >> + > > > >> + nla_for_each_nested(a, action, rem) { > > > >> + switch (nla_type(a)) { > > > >> + case OVS_CLONE_ATTR_EXEC: > > > >> + /* Real list of actions follows this > attribute. > > > >> + * Free them and return. */ > > > >> + a = nla_next(a, &rem); > > > >> + ovs_nla_free_flow_actions_nla(a, rem, > depth); > > > >> + return; > > > >> + } > > > >> + } > > > >> +} > > > >> + > > > >> +static void ovs_nla_free_dec_ttl_action(const struct nlattr > *action, int depth) > > > >> +{ > > > >> + const struct nlattr *a = nla_data(action); > > > >> + > > > >> + switch (nla_type(a)) { > > > >> + case OVS_DEC_TTL_ATTR_ACTION: > > > >> + ovs_nla_free_flow_actions_nla(nla_data(a), > nla_len(a), depth); > > > >> + break; > > > >> + } > > > >> +} > > > >> + > > > >> +static void ovs_nla_free_sample_action(const struct nlattr > *action, int depth) > > > >> +{ > > > >> + const struct nlattr *a = nla_data(action); > > > >> + > > > >> + switch (nla_type(a)) { > > > >> + case OVS_SAMPLE_ATTR_ARG: > > > >> + ovs_nla_free_flow_actions_nla(nla_data(a), > nla_len(a), depth); > > > >> + break; > > > >> + } > > > >> +} > > > >> + > > > >> static void ovs_nla_free_set_action(const struct nlattr *a) > > > >> { > > > >> const struct nlattr *ovs_key = nla_data(a); > > > >> @@ -2330,25 +2389,55 @@ static void ovs_nla_free_set_action(const > struct nlattr *a) > > > >> } > > > >> } > > > >> > > > >> -void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts) > > > >> +static void ovs_nla_free_flow_actions_nla(const struct nlattr > *actions, > > > >> + int len, int depth) > > > >> { > > > >> const struct nlattr *a; > > > >> int rem; > > > >> > > > >> - if (!sf_acts) > > > >> + if (!actions) > > > >> return; > > > >> > > > >> - nla_for_each_attr(a, sf_acts->actions, > sf_acts->actions_len, rem) { > > > >> + depth++; > > > >> + > > > >> + nla_for_each_attr(a, actions, len, rem) { > > > >> switch (nla_type(a)) { > > > >> - case OVS_ACTION_ATTR_SET: > > > >> - ovs_nla_free_set_action(a); > > > >> + case OVS_ACTION_ATTR_CHECK_PKT_LEN: > > > >> + ovs_nla_free_check_pkt_len_action(a, depth); > > > >> + break; > > > >> + > > > >> + case OVS_ACTION_ATTR_CLONE: > > > >> + ovs_nla_free_clone_action(a, depth); > > > >> break; > > > >> + > > > >> case OVS_ACTION_ATTR_CT: > > > >> + if (depth != 1) > > > >> + OVS_NLERR(true, > > > >> + "ovs_ct_free_action at > depth %d", depth); > > > >> ovs_ct_free_action(a); > > > >> break; > > > >> + > > > >> + case OVS_ACTION_ATTR_DEC_TTL: > > > >> + ovs_nla_free_dec_ttl_action(a, depth); > > > >> + break; > > > >> + > > > >> + case OVS_ACTION_ATTR_SAMPLE: > > > >> + ovs_nla_free_sample_action(a, depth); > > > >> + break; > > > >> + > > > >> + case OVS_ACTION_ATTR_SET: > > > >> + ovs_nla_free_set_action(a); > > > >> + break; > > > >> } > > > >> } > > > >> +} > > > >> + > > > >> +void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts) > > > >> +{ > > > >> + if (!sf_acts) > > > >> + return; > > > >> > > > >> + ovs_nla_free_flow_actions_nla(sf_acts->actions, > sf_acts->actions_len, 0); > > > >> kfree(sf_acts); > > > >> } > > > >> > > > >> @@ -3458,7 +3547,9 @@ static int clone_action_to_attr(const struct > nlattr *attr, > > > >> if (!start) > > > >> return -EMSGSIZE; > > > >> > > > >> - err = ovs_nla_put_actions(nla_data(attr), rem, skb); > > > >> + /* Skipping the OVS_CLONE_ATTR_EXEC that is always present. > */ > > > >> + attr = nla_next(nla_data(attr), &rem); > > > >> + err = ovs_nla_put_actions(attr, rem, skb); > > > >> > > > >> if (err) > > > >> nla_nest_cancel(skb, start); > > > >> --- > > > >> > > > >> Best regards, Ilya Maximets. > > > > _______________________________________________ > > > > dev mailing list > > > > [email protected] > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > > _______________________________________________ > > dev mailing list > > [email protected] > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
