On Mon, Oct 31, 2022 at 5:54 PM Ilya Maximets <i.maxim...@ovn.org> wrote:

> On 10/31/22 17:25, Donald Sharp via discuss wrote:
> > Hi!
> >
> > I work on the FRRouting project (https://frrouting/org <
> https://frrouting/org> ) and am doing work with FRR and have noticed that
> when I have a full BGP feed on a system that is also running ovs-vswitchd
> that ovs-vswitchd sits at 100% cpu:
> >
> > top - 09:43:12 up 4 days, 22:53,  3 users,  load average: 1.06, 1.08,
> 1.08
> > Tasks: 188 total,   3 running, 185 sleeping,   0 stopped,   0 zombie
> > %Cpu(s): 12.3 us, 14.7 sy,  0.0 ni, 72.8 id,  0.0 wa,  0.0 hi,  0.2 si,
> 0.0 st
> > MiB Mem :   7859.3 total,   2756.5 free,   2467.2 used,   2635.6
> buff/cache
> > MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.   5101.9 avail
> Mem
> >
> >     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+
> COMMAND
> >     730 root      10 -10  146204 146048  11636 R  98.3   1.8   6998:13
> ovs-vswitchd
> >  169620 root      20   0       0      0      0 I   3.3   0.0   1:34.83
> kworker/0:3-events
>
> >      21 root      20   0       0      0      0 S   1.3   0.0  14:09.59
> ksoftirqd/1
> >  131734 frr       15  -5 2384292 609556   6612 S   1.0   7.6  21:57.51
> zebra
> >  131739 frr       15  -5 1301168   1.0g   7420 S   1.0  13.3  18:16.17
> bgpd
> >
> > When I turn off FRR ( or turn off the bgp feed ) ovs-vswitchd stops
> running at 100%:
> >
> > top - 09:48:12 up 4 days, 22:58,  3 users,  load average: 0.08, 0.60,
> 0.89
> > Tasks: 169 total,   1 running, 168 sleeping,   0 stopped,   0 zombie
> > %Cpu(s):  0.2 us,  0.4 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.1 si,
> 0.0 st
> > MiB Mem :   7859.3 total,   4560.6 free,    663.1 used,   2635.6
> buff/cache
> > MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.   6906.1 avail
> Mem
> >
> >     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+
> COMMAND
>
> >  179064 sharpd    20   0   11852   3816   3172 R   1.0   0.0   0:00.09
> top
> >    1037 zerotie+  20   0  291852 113180   7408 S   0.7   1.4  19:09.17
> zerotier-one
> >    1043 Debian-+  20   0   34356  21988   7588 S   0.3   0.3  22:04.42
> snmpd
>
> >  178480 root      20   0       0      0      0 I   0.3   0.0   0:01.21
> kworker/1:2-events
> >  178622 sharpd    20   0   14020   6364   4872 S   0.3   0.1   0:00.10
> sshd
> >       1 root      20   0  169872  13140   8272 S   0.0   0.2   2:33.26
> systemd
>
> >       2 root      20   0       0      0      0 S   0.0   0.0   0:00.60
> kthreadd
> >
> > I do not have any particular ovs configuration on this box:
> > sharpd@janelle:~$ sudo ovs-vsctl show
> > c72d327c-61eb-4877-b4e7-dcf7e07e24fc
> >     ovs_version: "2.13.8"
> >
> >
> > sharpd@janelle:~$ sudo ovs-vsctl list o .
> > _uuid               : c72d327c-61eb-4877-b4e7-dcf7e07e24fc
> > bridges             : []
> > cur_cfg             : 0
> > datapath_types      : [netdev, system]
> > datapaths           : {}
> > db_version          : "8.2.0"
> > dpdk_initialized    : false
> > dpdk_version        : none
> > external_ids        : {hostname=janelle, rundir="/var/run/openvswitch",
> system-id="a1031fcf-8acc-40a9-9fd6-521716b0faaa"}
> > iface_types         : [erspan, geneve, gre, internal, ip6erspan, ip6gre,
> lisp, patch, stt, system, tap, vxlan]
> > manager_options     : []
> > next_cfg            : 0
> > other_config        : {}
> > ovs_version         : "2.13.8"
> > ssl                 : []
> > statistics          : {}
> > system_type         : ubuntu
> > system_version      : "20.04"
> >
> > sharpd@janelle:~$ sudo ovs-appctl dpctl/dump-flows -m
> > ovs-vswitchd: no datapaths exist
> > ovs-vswitchd: datapath not found (Invalid argument)
> > ovs-appctl: ovs-vswitchd: server returned an error
> >
> > Eli Britstein suggested I update ovs-openvswitch to latest and I did and
> saw the same behavior.  When I pulled up the running code in a debugger I
> see
> > that ovs-vswitchd is running in this loop below pretty much 100% of the
> time:
> >
> > (gdb) f 4
> > #4  0x0000559498b4e476 in route_table_run () at lib/route-table.c:133
> > 133                 nln_run(nln);
> > (gdb) l
> > 128             OVS_EXCLUDED(route_table_mutex)
> > 129         {
> > 130             ovs_mutex_lock(&route_table_mutex);
> > 131             if (nln) {
> > 132                 rtnetlink_run();
> > 133                 nln_run(nln);
> > 134
> > 135                 if (!route_table_valid) {
> > 136                     route_table_reset();
> > 137                 }
> > (gdb) l
> > 138             }
> > 139             ovs_mutex_unlock(&route_table_mutex);
> > 140         }
> >
> > I pulled up where route_table_valid is set:
> >
> > 298         static void
> > 299         route_table_change(const struct route_table_msg *change
> OVS_UNUSED,
> > 300                            void *aux OVS_UNUSED)
> > 301         {
> > 302             route_table_valid = false;
> > 303         }
> >
> >
> > If I am reading the code correctly, every RTM_NEWROUTE netlink message
> that ovs-vswitchd is getting
> > is setting the route_table_valid global variable to false and causing
> route_table_reset() to be run.
> > This makes sense in context of what FRR is doing.  A full BGP feed
> *always* has churn.  So ovs-vswitchd
> > is receiving. RTM_NEWROUTE message, parsing it and deciding in
> route_table_change() that the
> > route table is no longer valid and causing it to call
> route_table_reset() which redumps the entire
> > routing table to ovs-vswitchd.  In this case there are ~115k ipv6 routes
> in the linux fib.
> >
> > I hesitate to make any changes here since I really don't understand what
> the end goal here is.
> > ovs-vswitchd is receiving a route change from the kernel but is in turn
> causing it to redump the entire
> > routing table again.  What should be the correct behavior be from
> ovs-vswitchd's perspective here?
>
> Hi, Donald.
>
> Your analysis is correct.  OVS will invalidate the cached routing
> table and re-dump it in full on the next access on each netlink
> notification about route changes.
>
> Looking back into commit history, OVS did maintain the cache and
> only added/removed what was in the netlink message incrementally.
> But that changed in 2011 with the following commit:
>
> commit f0e167f0dbadbe2a8d684f63ad9faf68d8cb9884
> Author: Ethan J. Jackson <e...@eecs.berkeley.edu>
> Date:   Thu Jan 13 16:29:31 2011 -0800
>
>     route-table: Handle route updates more robustly.
>
>     The kernel does not broadcast rtnetlink route messages in all cases
>     one would expect.  This can cause stale entires to end up in the
>     route table which may cause incorrect results for
>     route_table_get_ifindex() queries.  This commit causes rtnetlink
>     route messages to dump the entire route table on the next
>     route_table_get_ifindex() query.
>
> And indeed, looking at the history of attempts of different projects
> to use route notifications, they all are facing issues and it seems
> like none of them is actually able to fully correctly handle all the
> notifications, just because these notifications are notoriously bad.
> It seems to be impossible in certain cases to tell what exactly
> changed and how.  There could be duplicates or missing notifications.
> And the code of projects that are trying to maintain a route cache
> in userspace is insanely complex and doesn't handle 100% of cases
> anyway.
>
> There were attempts to convince kernel developers to add unique
> identifiers to routes, so userspace can tell them apart, but all
> of them seems to die leaving the problem unresolved.
>
> These are some discussions/bugs that I found:
>   https://bugzilla.redhat.com/show_bug.cgi?id=1337855
>   https://bugzilla.redhat.com/show_bug.cgi?id=1722728
>   https://github.com/thom311/libnl/issues/226
>   https://github.com/thom311/libnl/issues/224
>
>
Hi Ilya!


I would argue that the only events that really cause desynchronization is
the NLM_F_APPEND
for a route message received.  Why not just look for that flag and then
reset that route, or at
worse just resync the whole table at that point.  This change is punishing
the 99% use case
where no-one uses the APPEND operation for just the exact reasons that are
outlined in the
above issues ( and frankly these issues are the bread and butter of what
I've spent a non-trivial
amount of time fixing in the FRR project over the last few years)

Interface up/down events also cause the kernel to not send route deletion
events, but these can
be inferred and is exactly what FRR does in this case.

None of the bugs seems to be resolved.  Most are closed for
> non-technical reasons.
>
> I suppose, Ethan just decided to not deal with that horribly
> unreliable kernel interface and just re-dump the route table on
> changes.
>
>
> For your actual problem here, I'm not sure if we can fix it
> that easily.
>
> Is it necessary for OVS to know about these routes?
> If no, it might be possible to isolate them in a separate network
> namespace, so OVS will not receive all the route updates?
>
>
I cannot answer the question :(.  I do not know what OVS uses the routes
for!  Hence this
discussion with the OVS community :)

>From my perspective, though, I would prefer that a solution is found where
ovs-vswitchd
does not initiate very expensive operations every time a route is changed
in the system.  Lots
of times these networking tools are on embedded systems with both limited
cpu and memory
and as such people will notice and come complaining to us developers.  I
can't tell you the number
of times I get pinged to look at side effects of other software for
something FRR is inserting into
the kernel.  One example:

https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/565

Do you know how long it takes to dump a route table once?
>

A full bgp feed of v4 and v6 routes is 1 million + routes at this point in
time
and will continue to grow over time.   I have not timed how long this takes
to
read in for ovs-vswtichd.  When I have ~100k routes installed it takes 5-6
seconds
for the ovs-vswitchd to settle down.  I cannot comment further on OVS
behavior
at this point in time.

thanks!

donald

Maybe it worth limiting that process to only dump once a second
> or once in a few seconds.  That should alleviate the load if the
> actual dump is relatively fast.
>
> Best regards, Ilya Maximets.
>
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to