Thanks for the patch!

I've applied it on top of my 5.17.1 and I'm now getting a very very
large number of:

[  188.616220] openvswitch: netlink: ovs_ct_free_action at depth 3

After just 3 minutes of uptime I'm seeing 230 such lines but also:

[  108.726680] net_ratelimit: 6283 callbacks suppressed
[  114.515126] net_ratelimit: 3959 callbacks suppressed
[  120.523252] net_ratelimit: 4388 callbacks suppressed
[  127.458617] net_ratelimit: 2992 callbacks suppressed
[  133.543592] net_ratelimit: 2348 callbacks suppressed
[  140.099915] net_ratelimit: 5089 callbacks suppressed
[  145.116922] net_ratelimit: 6136 callbacks suppressed
[  150.551786] net_ratelimit: 5953 callbacks suppressed
[  155.555861] net_ratelimit: 5108 callbacks suppressed
[  162.740087] net_ratelimit: 6507 callbacks suppressed
[  168.059741] net_ratelimit: 3161 callbacks suppressed
[  173.092062] net_ratelimit: 4377 callbacks suppressed
[  178.104068] net_ratelimit: 3090 callbacks suppressed
[  183.135704] net_ratelimit: 2735 callbacks suppressed
[  188.616160] net_ratelimit: 2893 callbacks suppressed
[  193.628158] net_ratelimit: 4149 callbacks suppressed
[  199.128232] net_ratelimit: 3813 callbacks suppressed
[  204.136278] net_ratelimit: 2884 callbacks suppressed
[  211.443642] net_ratelimit: 5322 callbacks suppressed
[  216.643209] net_ratelimit: 5820 callbacks suppressed
[  221.648342] net_ratelimit: 4908 callbacks suppressed
[  226.656369] net_ratelimit: 4533 callbacks suppressed
[  231.680416] net_ratelimit: 4247 callbacks suppressed
[  237.172416] net_ratelimit: 2481 callbacks suppressed

I did a 30min test run like earlier and I'm now seeing a tiny increase
in kmalloc-256 of just 448KB :)

So that fix definitely did the trick to plug that leak!

Tested-by: Stéphane Graber <[email protected]>

Stéphane


On Wed, Mar 30, 2022 at 9:24 PM Ilya Maximets <[email protected]> wrote:
>
> On 3/31/22 03:00, Stéphane Graber wrote:
> > So it looks like my main issue and the one mentioned in the PS section
> > may actually be one and the same.
> >
> > I've run an experiment now where I monitor the normal leakage over
> > 30min, getting me around 1.1GiB of extra kmalloc-256.
> > I then repeated the experiment but with firewall (good old iptables)
> > dropping all non-ICMP traffic headed to the OVN router addresses, this
> > reduced the leakage to just 89MiB.
> >
> > So it looks like that traffic is getting lost into a redirection loop
> > for a while in OVN/OVS, triggers the "openvswitch: ovs-system:
> > deferred action limit reached, drop recirc action" entry and also
> > leaks a bunch of memory in the process. We did attempt to pin the
> > issue to this cause earlier by generating some traffic which would hit
> > the recirc loop but that didn't cause a very visible increase at the
> > time. Given the new data however, our attempt at manually triggering
> > it on an otherwise unused network must have been flawed somehow.
> >
> > The amount of traffic dropped by the firewall is quite small too which
> > suggests that things could be significantly worse.
> > In this case I saw a total of 3478 packets dropped for a total of 183852 
> > bytes.
> >
> > We'll be doing a bit more digging on the OVN side to see what may be
> > causing this and will report back.
> >
> > Stéphane
> >
> >
> >
> >
> > Stéphane
> >
> > On Wed, Mar 30, 2022 at 5:08 PM Stéphane Graber <[email protected]> wrote:
> >>
> >> Hello,
> >>
> >> I'm using OVS 2.17.0 combined with OVN 22.03.0 on Ubuntu 20.04 and a
> >> 5.17.1 mainline kernel.
> >>
> >> I'm trying to debug a very very problematic kernel memory leak which
> >> is happening in this environment.
> >> Sadly I don't have a clear reproducer or much idea of when it first
> >> appeared. I went through about a year of metrics and some amount of
> >> leakage may always have been present, the magnitude of it just
> >> changing recently making it such that my normal weekly server
> >> maintenance is now not frequent enough to take care of it.
> >>
> >> Basically what I'm doing is running a LXD + OVN setup on 3 servers,
> >> they all act as OVN chassis with various priorities to spread the
> >> load.
> >> All 3 servers also normally run a combination of containers and
> >> virtual machines attached to about a dozen different OVN networks.
> >>
> >> What I'm seeing is about 2MB/s of memory leakage (kmalloc-256 slub)
> >> which after enabling slub debugging can be tracked down to
> >> nf_ct_tmpl_alloc kernel calls such as those made by the openvswitch
> >> kernel module as part of its ovs_ct_copy_action function which is
> >> exposed as OVS_ACTION_ATTR_CT to userspace through the openvswitch
> >> netlink API.
> >>
> >> This means that in just a couple of days I'm dealing with just shy of
> >> 40GiB of those kmalloc-256 entries.
> >> Here is one of the servers which has been running for just 6 hours:
> >>
> >> ```
> >> root@abydos:~# uptime
> >>  20:51:44 up  6:32,  1 user,  load average: 6.63, 5.95, 5.26
> >>
> >> root@abydos:~# slabtop -o -s c | head -n10
> >>  Active / Total Objects (% used)    : 24919212 / 25427299 (98.0%)
> >>  Active / Total Slabs (% used)      : 541777 / 541777 (100.0%)
> >>  Active / Total Caches (% used)     : 150 / 197 (76.1%)
> >>  Active / Total Size (% used)       : 11576509.49K / 11680410.31K (99.1%)
> >>  Minimum / Average / Maximum Object : 0.01K / 0.46K / 50.52K
> >>
> >>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> >> 13048822 13048737  99%    0.75K 310688       42   9942016K kmalloc-256
> >> 2959671 2677549  90%    0.10K  75889       39    303556K buffer_head
> >> 460411 460411 100%    0.57K  16444       28    263104K radix_tree_node
> >>
> >> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/alloc_traces |
> >> sort -rn | head -n5
> >> 12203895 nf_ct_tmpl_alloc+0x55/0xb0 [nf_conntrack]
> >> age=57/2975485/5871871 pid=1964-2048 cpus=0-31 nodes=0-1
> >>  803599 metadata_dst_alloc+0x25/0x50 age=2/2663973/5873408
> >> pid=0-683331 cpus=0-31 nodes=0-1
> >>   32773 memcg_alloc_slab_cgroups+0x3d/0x90 age=386/4430883/5878515
> >> pid=1-731302 cpus=0-31 nodes=0-1
> >>    3861 do_seccomp+0xdb/0xb80 age=749613/4661870/5878386
> >> pid=752-648826 cpus=0-31 nodes=0-1
> >>    2314 device_add+0x504/0x920 age=751269/5665662/5883377 pid=1-648698
> >> cpus=0-31 nodes=0-1
> >>
> >> root@abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/free_traces |
> >> sort -rn | head -n5
> >> 8129152 <not-available> age=4300785451 pid=0 cpus=0 nodes=0-1
> >> 2770915 reserve_sfa_size+0xdf/0x110 [openvswitch]
> >> age=1912/2970717/5881994 pid=1964-2069 cpus=0-31 nodes=0-1
> >> 1621182 dst_destroy+0x70/0xd0 age=4/3065853/5883592 pid=0-733033
> >> cpus=0-31 nodes=0-1
> >>  288686 nf_ct_tmpl_free+0x1b/0x30 [nf_conntrack]
> >> age=109/2985710/5879968 pid=0-733208 cpus=0-31 nodes=0-1
> >>  134435 ovs_nla_free_flow_actions+0x68/0x90 [openvswitch]
> >> age=134/2955781/5883717 pid=0-733208 cpus=0-31 nodes=0-1
> >> ```
> >>
> >> Here you can see 12M calls to nf_ct_tmpl_alloc but just 288k to 
> >> nf_ct_tmpl_free.
> >>
> >> Things I've done so far to try to isolate things:
> >>
> >> 1) I've evacuated all workloads from the server, so the only thing
> >> running on it is OVS vswitchd. This did not change anything.
> >> 2) I've added iptables/ip6tables raw table rules marking all traffic
> >> as NOTRACK. This did not change anything.
> >> 3) I've played with the chassis assignment. This does change things in
> >> that a server with no active chassis will not show any leakage
> >> (thankfully). The busier the network I move back to the host, the
> >> faster the leakage.
> >>
> >>
> >> I've had both Frode and Tom (Cced) assist with a variety of ideas and
> >> questions but while we have found some unrelated OVS and kernel
> >> issues, we're yet to figure this one out. So I wanted to reach out to
> >> the wider community to see if anyone has either seen something like
> >> this before or has suggestions as to where to look next.
> >>
> >> I can pretty easily rebuild the kernel, OVS or OVN and while this
> >> cluster is a production environment, the fact that I can evacuate one
> >> of the three servers with no user-visible impact makes it not too bad
> >> to debug. Having to constantly reboot the entire setup to clear the
> >> memory leak is the bigger annoyance right now :)
> >>
> >> Stéphane
> >>
> >> PS: The side OVS/kernel issue I'm referring to is
> >> https://lore.kernel.org/netdev/[email protected]/
> >> which allowed us to track down an issue with OVN logical routers
> >> properly responding to ICMP on their external address but getting into
> >> a recirculation loop when any other kind of traffic is thrown at them
> >> (instead of being immediately dropped or rejected). The kernel change
> >> made it possible to track this down.
>
> Hi, Stéphane.
>
> Thanks for the report.
>
> I'm not sure if that's what you see, but I think that I found one
> pretty serious memory leak in the openvswitch module.  In short,
> it leaks all the dynamically allocated memory for nested actions.
> E.g. if you have a datapath flow with actions:clone(ct(commit)),
> the structure allocated by nf_ct_tmpl_alloc for a nested ct() action
> will not be freed.
>
> Could you try the change below?
>
> It will additionally print a rate-limited error, if the nested CT
> encountered:
>   openvswitch: netlink: ovs_ct_free_action at depth 2
>
> This message can be used to confirm the issue.
>
> The very last bit of the fix is not necessary, but I just spotted
> the additional formatting issue and fixed it.
>
> ---
> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> index 5176f6ccac8e..248c39da465e 100644
> --- a/net/openvswitch/flow_netlink.c
> +++ b/net/openvswitch/flow_netlink.c
> @@ -2317,6 +2317,65 @@ static struct sw_flow_actions 
> *nla_alloc_flow_actions(int size)
>         return sfa;
>  }
>
> +static void ovs_nla_free_flow_actions_nla(const struct nlattr *actions,
> +                                         int len, int depth);
> +
> +static void ovs_nla_free_check_pkt_len_action(const struct nlattr *action,
> +                                             int depth)
> +{
> +       const struct nlattr *a;
> +       int rem;
> +
> +       nla_for_each_nested(a, action, rem) {
> +               switch (nla_type(a)) {
> +               case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_LESS_EQUAL:
> +               case OVS_CHECK_PKT_LEN_ATTR_ACTIONS_IF_GREATER:
> +                       ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a),
> +                                                     depth);
> +                       break;
> +               }
> +       }
> +}
> +
> +static void ovs_nla_free_clone_action(const struct nlattr *action, int depth)
> +{
> +       const struct nlattr *a;
> +       int rem;
> +
> +       nla_for_each_nested(a, action, rem) {
> +               switch (nla_type(a)) {
> +               case OVS_CLONE_ATTR_EXEC:
> +                       /* Real list of actions follows this attribute.
> +                        * Free them and return. */
> +                       a = nla_next(a, &rem);
> +                       ovs_nla_free_flow_actions_nla(a, rem, depth);
> +                       return;
> +               }
> +       }
> +}
> +
> +static void ovs_nla_free_dec_ttl_action(const struct nlattr *action, int 
> depth)
> +{
> +       const struct nlattr *a = nla_data(action);
> +
> +       switch (nla_type(a)) {
> +       case OVS_DEC_TTL_ATTR_ACTION:
> +               ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a), depth);
> +               break;
> +       }
> +}
> +
> +static void ovs_nla_free_sample_action(const struct nlattr *action, int 
> depth)
> +{
> +       const struct nlattr *a = nla_data(action);
> +
> +       switch (nla_type(a)) {
> +       case OVS_SAMPLE_ATTR_ARG:
> +               ovs_nla_free_flow_actions_nla(nla_data(a), nla_len(a), depth);
> +               break;
> +       }
> +}
> +
>  static void ovs_nla_free_set_action(const struct nlattr *a)
>  {
>         const struct nlattr *ovs_key = nla_data(a);
> @@ -2330,25 +2389,55 @@ static void ovs_nla_free_set_action(const struct 
> nlattr *a)
>         }
>  }
>
> -void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
> +static void ovs_nla_free_flow_actions_nla(const struct nlattr *actions,
> +                                         int len, int depth)
>  {
>         const struct nlattr *a;
>         int rem;
>
> -       if (!sf_acts)
> +       if (!actions)
>                 return;
>
> -       nla_for_each_attr(a, sf_acts->actions, sf_acts->actions_len, rem) {
> +       depth++;
> +
> +       nla_for_each_attr(a, actions, len, rem) {
>                 switch (nla_type(a)) {
> -               case OVS_ACTION_ATTR_SET:
> -                       ovs_nla_free_set_action(a);
> +               case OVS_ACTION_ATTR_CHECK_PKT_LEN:
> +                       ovs_nla_free_check_pkt_len_action(a, depth);
> +                       break;
> +
> +               case OVS_ACTION_ATTR_CLONE:
> +                       ovs_nla_free_clone_action(a, depth);
>                         break;
> +
>                 case OVS_ACTION_ATTR_CT:
> +                       if (depth != 1)
> +                               OVS_NLERR(true,
> +                                       "ovs_ct_free_action at depth %d", 
> depth);
>                         ovs_ct_free_action(a);
>                         break;
> +
> +               case OVS_ACTION_ATTR_DEC_TTL:
> +                       ovs_nla_free_dec_ttl_action(a, depth);
> +                       break;
> +
> +               case OVS_ACTION_ATTR_SAMPLE:
> +                       ovs_nla_free_sample_action(a, depth);
> +                       break;
> +
> +               case OVS_ACTION_ATTR_SET:
> +                       ovs_nla_free_set_action(a);
> +                       break;
>                 }
>         }
> +}
> +
> +void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
> +{
> +       if (!sf_acts)
> +               return;
>
> +       ovs_nla_free_flow_actions_nla(sf_acts->actions, sf_acts->actions_len, 
> 0);
>         kfree(sf_acts);
>  }
>
> @@ -3458,7 +3547,9 @@ static int clone_action_to_attr(const struct nlattr 
> *attr,
>         if (!start)
>                 return -EMSGSIZE;
>
> -       err = ovs_nla_put_actions(nla_data(attr), rem, skb);
> +       /* Skipping the OVS_CLONE_ATTR_EXEC that is always present. */
> +       attr = nla_next(nla_data(attr), &rem);
> +       err = ovs_nla_put_actions(attr, rem, skb);
>
>         if (err)
>                 nla_nest_cancel(skb, start);
> ---
>
> Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to