On 3/31/22 03:00, Stéphane Graber wrote:
> So it looks like my main issue and the one mentioned in the PS section
> may actually be one and the same.
> 
> I've run an experiment now where I monitor the normal leakage over
> 30min, getting me around 1.1GiB of extra kmalloc-256.
> I then repeated the experiment but with firewall (good old iptables)
> dropping all non-ICMP traffic headed to the OVN router addresses, this
> reduced the leakage to just 89MiB.
> 
> So it looks like that traffic is getting lost into a redirection loop
> for a while in OVN/OVS, triggers the "openvswitch: ovs-system:
> deferred action limit reached, drop recirc action" entry and also
> leaks a bunch of memory in the process. We did attempt to pin the
> issue to this cause earlier by generating some traffic which would hit
> the recirc loop but that didn't cause a very visible increase at the
> time. Given the new data however, our attempt at manually triggering
> it on an otherwise unused network must have been flawed somehow.
> 
> The amount of traffic dropped by the firewall is quite small too which
> suggests that things could be significantly worse.
> In this case I saw a total of 3478 packets dropped for a total of 183852 
> bytes.
> 
> We'll be doing a bit more digging on the OVN side to see what may be
> causing this and will report back.
> 
> Stéphane
> 
> 
> 
> 
> Stéphane
> 
> On Wed, Mar 30, 2022 at 5:08 PM Stéphane Graber <[email protected]> wrote:
>>
>> Hello,
>>
>> I'm using OVS 2.17.0 combined with OVN 22.03.0 on Ubuntu 20.04 and a
>> 5.17.1 mainline kernel.
>>
>> I'm trying to debug a very very problematic kernel memory leak which
>> is happening in this environment.
>> Sadly I don't have a clear reproducer or much idea of when it first
>> appeared. I went through about a year of metrics and some amount of
>> leakage may always have been present, the magnitude of it just
>> changing recently making it such that my normal weekly server
>> maintenance is now not frequent enough to take care of it.
>>
>> Basically what I'm doing is running a LXD + OVN setup on 3 servers,
>> they all act as OVN chassis with various priorities to spread the
>> load.
>> All 3 servers also normally run a combination of containers and
>> virtual machines attached to about a dozen different OVN networks.
>>
>> What I'm seeing is about 2MB/s of memory leakage (kmalloc-256 slub)
>> which after enabling slub debugging can be tracked down to
>> nf_ct_tmpl_alloc kernel calls such as those made by the openvswitch
>> kernel module as part of its ovs_ct_copy_action function which is
>> exposed as OVS_ACTION_ATTR_CT to userspace through the openvswitch
>> netlink API.
>>
>> This means that in just a couple of days I'm dealing with just shy of
>> 40GiB of those kmalloc-256 entries.
>> Here is one of the servers which has been running for just 6 hours:
>>
>> ```
>> root@abydos:~# uptime
>>  20:51:44 up  6:32,  1 user,  load average: 6.63, 5.95, 5.26
>>
>> root@abydos:~# slabtop -o -s c | head -n10
>>  Active / Total Objects (% used)    : 24919212 / 25427299 (98.0%)
>>  Active / Total Slabs (% used)      : 541777 / 541777 (100.0%)
>>  Active / Total Caches (% used)     : 150 / 197 (76.1%)
>>  Active / Total Size (% used)       : 11576509.49K / 11680410.31K (99.1%)
>>  Minimum / Average / Maximum Object : 0.01K / 0.46K / 50.52K
>>
>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>> 13048822 13048737  99%    0.75K 310688       42   9942016K kmalloc-256
>> 2959671 2677549  90%    0.10K  75889       39    303556K buffer_head
>> 460411 460411 100%    0.57K  16444       28    263104K radix_tree_node
>>
>> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/alloc_traces |
>> sort -rn | head -n5
>> 12203895 nf_ct_tmpl_alloc+0x55/0xb0 [nf_conntrack]
>> age=57/2975485/5871871 pid=1964-2048 cpus=0-31 nodes=0-1
>>  803599 metadata_dst_alloc+0x25/0x50 age=2/2663973/5873408
>> pid=0-683331 cpus=0-31 nodes=0-1
>>   32773 memcg_alloc_slab_cgroups+0x3d/0x90 age=386/4430883/5878515
>> pid=1-731302 cpus=0-31 nodes=0-1
>>    3861 do_seccomp+0xdb/0xb80 age=749613/4661870/5878386
>> pid=752-648826 cpus=0-31 nodes=0-1
>>    2314 device_add+0x504/0x920 age=751269/5665662/5883377 pid=1-648698
>> cpus=0-31 nodes=0-1
>>
>> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/free_traces |
>> sort -rn | head -n5
>> 8129152 <not-available> age=4300785451 pid=0 cpus=0 nodes=0-1
>> 2770915 reserve_sfa_size+0xdf/0x110 [openvswitch]
>> age=1912/2970717/5881994 pid=1964-2069 cpus=0-31 nodes=0-1
>> 1621182 dst_destroy+0x70/0xd0 age=4/3065853/5883592 pid=0-733033
>> cpus=0-31 nodes=0-1
>>  288686 nf_ct_tmpl_free+0x1b/0x30 [nf_conntrack]
>> age=109/2985710/5879968 pid=0-733208 cpus=0-31 nodes=0-1
>>  134435 ovs_nla_free_flow_actions+0x68/0x90 [openvswitch]
>> age=134/2955781/5883717 pid=0-733208 cpus=0-31 nodes=0-1
>> ```
>>
>> Here you can see 12M calls to nf_ct_tmpl_alloc but just 288k to 
>> nf_ct_tmpl_free.
>>
>> Things I've done so far to try to isolate things:
>>
>> 1) I've evacuated all workloads from the server, so the only thing
>> running on it is OVS vswitchd. This did not change anything.
>> 2) I've added iptables/ip6tables raw table rules marking all traffic
>> as NOTRACK. This did not change anything.
>> 3) I've played with the chassis assignment. This does change things in
>> that a server with no active chassis will not show any leakage
>> (thankfully). The busier the network I move back to the host, the
>> faster the leakage.
>>
>>
>> I've had both Frode and Tom (Cced) assist with a variety of ideas and
>> questions but while we have found some unrelated OVS and kernel
>> issues, we're yet to figure this one out. So I wanted to reach out to
>> the wider community to see if anyone has either seen something like
>> this before or has suggestions as to where to look next.
>>
>> I can pretty easily rebuild the kernel, OVS or OVN and while this
>> cluster is a production environment, the fact that I can evacuate one
>> of the three servers with no user-visible impact makes it not too bad
>> to debug. Having to constantly reboot the entire setup to clear the
>> memory leak is the bigger annoyance right now :)
>>
>> Stéphane
>>
>> PS: The side OVS/kernel issue I'm referring to is
>> https://lore.kernel.org/netdev/[email protected]/
>> which allowed us to track down an issue with OVN logical routers
>> properly responding to ICMP on their external address but getting into
>> a recirculation loop when any other kind of traffic is thrown at them
>> (instead of being immediately dropped or rejected). The kernel change
>> made it possible to track this down.

Hi, Stéphane.

Thanks for the report.

I'm not sure if that's what you see, but I think that I found one
pretty serious memory leak in the openvswitch module.  In short,
it leaks all the dynamically allocated memory for nested actions.
E.g. if you have a datapath flow with actions:clone(ct(commit)),
the structure allocated by nf_ct_tmpl_alloc for a nested ct() action
will not be freed.

Could you try the change below?

It will additionally print a rate-limited error, if the nested CT
encountered:
  openvswitch: netlink: ovs_ct_free_action at depth 2

This message can be used to confirm the issue.

The very last bit of the fix is not necessary, but I just spotted
the additional formatting issue and fixed it.

---
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 5176f6ccac8e..248c39da465e 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2317,6 +2317,65 @@ static struct sw_flow_actions 
*nla_alloc_flow_actions(int size)
        return sfa;
 }
 
+static void ovs_nla_free_flow_actions_nla(const struct nlattr *actions,
+                                         int len, int depth);
+
+static void ovs_nla_free_check_pkt_len_action(const struct nlattr *action,
+                                             int depth)
+{
+       const struct nlattr *a;
+       int rem;
+
+       nla_for_each_nested(a, action, rem) {
+               switch (nla_type(a)) {
+               case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_LESS_EQUAL:
+               case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_GREATER:
+                       ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a),
+                                                     depth);
+                       break;
+               }
+       }
+}
+
+static void ovs_nla_free_clone_action(const struct nlattr *action, int depth)
+{
+       const struct nlattr *a;
+       int rem;
+
+       nla_for_each_nested(a, action, rem) {
+               switch (nla_type(a)) {
+               case OVS_CLONE_ATTR_EXEC:
+                       /* Real list of actions follows this attribute.
+                        * Free them and return. */
+                       a = nla_next(a, &rem);
+                       ovs_nla_free_flow_actions_nla(a, rem, depth);
+                       return;
+               }
+       }
+}
+
+static void ovs_nla_free_dec_ttl_action(const struct nlattr *action, int depth)
+{
+       const struct nlattr *a = nla_data(action);
+
+       switch (nla_type(a)) {
+       case OVS_DEC_TTL_ATTR_ACTION:
+               ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a), depth);
+               break;
+       }
+}
+
+static void ovs_nla_free_sample_action(const struct nlattr *action, int depth)
+{
+       const struct nlattr *a = nla_data(action);
+
+       switch (nla_type(a)) {
+       case OVS_SAMPLE_ATTR_ARG:
+               ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a), depth);
+               break;
+       }
+}
+
 static void ovs_nla_free_set_action(const struct nlattr *a)
 {
        const struct nlattr *ovs_key = nla_data(a);
@@ -2330,25 +2389,55 @@ static void ovs_nla_free_set_action(const struct nlattr 
*a)
        }
 }
 
-void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
+static void ovs_nla_free_flow_actions_nla(const struct nlattr *actions,
+                                         int len, int depth)
 {
        const struct nlattr *a;
        int rem;
 
-       if (!sf_acts)
+       if (!actions)
                return;
 
-       nla_for_each_attr(a, sf_acts->actions, sf_acts->actions_len, rem) {
+       depth++;
+
+       nla_for_each_attr(a, actions, len, rem) {
                switch (nla_type(a)) {
-               case OVS_ACTION_ATTR_SET:
-                       ovs_nla_free_set_action(a);
+               case OVS_ACTION_ATTR_CHECK_PKT_LEN:
+                       ovs_nla_free_check_pkt_len_action(a, depth);
+                       break;
+
+               case OVS_ACTION_ATTR_CLONE:
+                       ovs_nla_free_clone_action(a, depth);
                        break;
+
                case OVS_ACTION_ATTR_CT:
+                       if (depth != 1)
+                               OVS_NLERR(true,
+                                       "ovs_ct_free_action at depth %d", 
depth);
                        ovs_ct_free_action(a);
                        break;
+
+               case OVS_ACTION_ATTR_DEC_TTL:
+                       ovs_nla_free_dec_ttl_action(a, depth);
+                       break;
+
+               case OVS_ACTION_ATTR_SAMPLE:
+                       ovs_nla_free_sample_action(a, depth);
+                       break;
+
+               case OVS_ACTION_ATTR_SET:
+                       ovs_nla_free_set_action(a);
+                       break;
                }
        }
+}
+
+void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
+{
+       if (!sf_acts)
+               return;
 
+       ovs_nla_free_flow_actions_nla(sf_acts->actions, sf_acts->actions_len, 
0);
        kfree(sf_acts);
 }
 
@@ -3458,7 +3547,9 @@ static int clone_action_to_attr(const struct nlattr *attr,
        if (!start)
                return -EMSGSIZE;
 
-       err = ovs_nla_put_actions(nla_data(attr), rem, skb);
+       /* Skipping the OVS_CLONE_ATTR_EXEC that is always present. */
+       attr = nla_next(nla_data(attr), &rem);
+       err = ovs_nla_put_actions(attr, rem, skb);
 
        if (err)
                nla_nest_cancel(skb, start);
---

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to