Ilya Maximets <i.maxim...@ovn.org> writes: > Currently, ovs-vswitchd is subscribed to all the routing changes in the > kernel. On each change, it marks the internal routing table cache as > invalid, then resets it and dumps all the routes from the kernel from > scratch. The reason for that is kernel routing updates not being > reliable in a sense that it's hard to tell which route is getting > removed or modified. Userspace application has to track the order in > which route entries are dumped from the kernel. Updates can get lost > or even duplicated and the kernel doesn't provide a good mechanism to > distinguish one route from another. To my knowledge, dumping all the > routes from a kernel after each change is the only way to keep the > cache consistent. Some more info can be found in the following never > addressed issues: > https://bugzilla.redhat.com/1337860 > https://bugzilla.redhat.com/1337855 > > It seems to be believed that NetworkManager "mostly" does incremental > updates right. But it is still not completely correct, will re-dump > the whole table in certain cases, and it takes a huge amount of very > complicated code to do the accounting and route comparisons. > > Going back to ovs-vswitchd, it currently dumps routes from all the > routing tables. If it will get conflicting routes from multiple > tables, the cache will not be useful. The routing cache in userspace > is primarily used for checking the egress port for tunneled traffic > and this way also detecting link state changes for a tunnel port. > For userspace datapath it is used for actual routing of the packet > after sending to a native tunnel. > With kernel datapath we don't really have a mechanism to know which > routing table will actually be used by the kernel after encapsulation, > so our lookups on a cache may be incorrect because of this as well. > > So, unless all the relevant routes are in the standard tables, the > lookup in userspace route cache is unreliable. > > Luckily, most setups are not using any complicated routing in > non-standard tables that OVS has to be aware of. > > It is possible, but unlikely, that standard routing tables are > completely empty while some other custom table is not, and all the OVS > tunnel traffic is directed to that table. That would be the only > scenario where dumping non-standard tables would make sense. But it > seems like this kind of setup will likely need a way to tell OVS from > which table the routes should be taken, or we'll need to dump routing > rules and keep a separate cache for each table, so we can first match > on rules and then lookup correct routes in a specific table. I'm not > sure if trying to implement all that is justified. > > For now, stop considering routes from non-standard tables to avoid > mixing different tables together and also wasting CPU resources. > > This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is > running on a same host and in a same network namespace with OVS using > its own custom routing table. > > Unfortunately, there seems to be no way to tell the kernel to send > updates only for particular tables. So, we'll still receive and parse > all of them. But they will not result in a full cache invalidation in > most cases. > > Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps. > So, we can make use of it and dump only standard tables when we get a > relevant route update. NETLINK_GET_STRICT_CHK has to be enabled on > the socket for filtering to work. There is no reason to not enable it > by default, if supported. It is not used outside of NETLINK_ROUTE. > > Fixes: f0e167f0dbad ("route-table: Handle route updates more robustly.") > Fixes: ea83a2fcd0d3 ("lib: Show tunnel egress interface in ovsdb") > Reported-at: https://github.com/openvswitch/ovs-issues/issues/185 > Reported-at: > https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html > Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> > --- > lib/netlink-protocol.h | 10 ++++++ > lib/netlink-socket.c | 8 +++++ > lib/route-table.c | 80 +++++++++++++++++++++++++++++++++--------- > tests/system-route.at | 64 +++++++++++++++++++++++++++++++++ > 4 files changed, 146 insertions(+), 16 deletions(-) > > diff --git a/lib/netlink-protocol.h b/lib/netlink-protocol.h > index 6eaa7035a..e4bb28ac9 100644 > --- a/lib/netlink-protocol.h > +++ b/lib/netlink-protocol.h > @@ -155,6 +155,11 @@ enum { > #define NLA_TYPE_MASK ~(NLA_F_NESTED | NLA_F_NET_BYTEORDER) > #endif > > +/* Introduced in v4.4. */ > +#ifndef NLM_F_DUMP_FILTERED > +#define NLM_F_DUMP_FILTERED 0x20 > +#endif > + > /* These were introduced all together in 2.6.14. (We want our programs to > * support the newer kernel features even if compiled with older headers.) */ > #ifndef NETLINK_ADD_MEMBERSHIP > @@ -168,6 +173,11 @@ enum { > #define NETLINK_LISTEN_ALL_NSID 8 > #endif > > +/* Strict checking of netlink arguments introduced in Linux kernel v4.20. */ > +#ifndef NETLINK_GET_STRICT_CHK > +#define NETLINK_GET_STRICT_CHK 12 > +#endif > + > /* These were introduced all together in 2.6.23. (We want our programs to > * support the newer kernel features even if compiled with older headers.) */ > #ifndef CTRL_ATTR_MCAST_GRP_MAX > diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c > index 80da20d9f..eea880b2c 100644 > --- a/lib/netlink-socket.c > +++ b/lib/netlink-socket.c > @@ -205,6 +205,14 @@ nl_sock_create(int protocol, struct nl_sock **sockp) > } > } > > + /* Strict checking only supported for NETLINK_ROUTE. */ > + if (protocol == NETLINK_ROUTE > + && setsockopt(sock->fd, SOL_NETLINK, NETLINK_GET_STRICT_CHK, > + &one, sizeof one) < 0) { > + VLOG_INFO("netlink: could not enable strict checking (%s)", > + ovs_strerror(errno));
Maybe this message is a bit scary for users - but I'm not sure what message might be less alarming. Something like: netlink: strict checking off (%s) Not a big deal, just a kindof nit. I hope it doesn't become VLOG_WARN later or something, and want to avoid someone reading it and thinking that it is a misclassified warning. I don't think it could happen often, but if we have opened a large number of nl sockets (maybe some large nl_pool_alloc) then we end up with quite a few of these messages as well. Maybe we can just vlog this once? > + } > + > retval = get_socket_rcvbuf(sock->fd); > if (retval < 0) { > retval = -retval; > diff --git a/lib/route-table.c b/lib/route-table.c > index 9927dcc18..f1fe32714 100644 > --- a/lib/route-table.c > +++ b/lib/route-table.c > @@ -26,6 +26,7 @@ > #include <linux/rtnetlink.h> > #include <net/if.h> > > +#include "coverage.h" > #include "hash.h" > #include "netdev.h" > #include "netlink.h" > @@ -44,6 +45,8 @@ > > VLOG_DEFINE_THIS_MODULE(route_table); > > +COVERAGE_DEFINE(route_table_dump); > + > struct route_data { > /* Copied from struct rtmsg. */ > unsigned char rtm_dst_len; > @@ -80,7 +83,7 @@ static struct nln_notifier *name_notifier = NULL; > > static bool route_table_valid = false; > > -static int route_table_reset(void); > +static void route_table_reset(void); > static void route_table_handle_msg(const struct route_table_msg *); > static int route_table_parse(struct ofpbuf *, struct route_table_msg *); > static void route_table_change(const struct route_table_msg *, void *); > @@ -153,26 +156,22 @@ route_table_wait(void) > ovs_mutex_unlock(&route_table_mutex); > } > > -static int > -route_table_reset(void) > +static bool > +route_table_dump_one_table(unsigned char id) > { > - struct nl_dump dump; > - struct rtgenmsg *rtgenmsg; > uint64_t reply_stub[NL_DUMP_BUFSIZE / 8]; > struct ofpbuf request, reply, buf; > - > - route_map_clear(); > - netdev_get_addrs_list_flush(); > - route_table_valid = true; > - rt_change_seq++; > + struct rtmsg *rq_msg; > + bool filtered = true; > + struct nl_dump dump; > > ofpbuf_init(&request, 0); > > - nl_msg_put_nlmsghdr(&request, sizeof *rtgenmsg, RTM_GETROUTE, > - NLM_F_REQUEST); > + nl_msg_put_nlmsghdr(&request, sizeof *rq_msg, RTM_GETROUTE, > NLM_F_REQUEST); > > - rtgenmsg = ofpbuf_put_zeros(&request, sizeof *rtgenmsg); > - rtgenmsg->rtgen_family = AF_UNSPEC; > + rq_msg = ofpbuf_put_zeros(&request, sizeof *rq_msg); > + rq_msg->rtm_family = AF_UNSPEC; > + rq_msg->rtm_table = id; Any reason to not consider setting RTA_TABLE attribute rather than use rtm_table. RTA_TABLE supports the extended range, rather than just the 8bits range that rtm_table can fit. > > nl_dump_start(&dump, NETLINK_ROUTE, &request); > ofpbuf_uninit(&request); > @@ -182,12 +181,43 @@ route_table_reset(void) > struct route_table_msg msg; > > if (route_table_parse(&reply, &msg)) { > + struct nlmsghdr *nlmsghdr = nl_msg_nlmsghdr(&reply); > + > + /* Older kernels do not support filtering. */ > + if (!(nlmsghdr->nlmsg_flags & NLM_F_DUMP_FILTERED)) { > + filtered = false; > + } > route_table_handle_msg(&msg); > } > } > ofpbuf_uninit(&buf); > + nl_dump_done(&dump); > + > + return filtered; > +} > + > +static void > +route_table_reset(void) > +{ > + unsigned char tables[] = { > + RT_TABLE_DEFAULT, > + RT_TABLE_MAIN, > + RT_TABLE_LOCAL, > + }; > > - return nl_dump_done(&dump); > + route_map_clear(); > + netdev_get_addrs_list_flush(); > + route_table_valid = true; > + rt_change_seq++; > + > + COVERAGE_INC(route_table_dump); > + > + for (size_t i = 0; i < ARRAY_SIZE(tables); i++) { > + if (!route_table_dump_one_table(tables[i])) { > + /* Got unfiltered reply, no need to dump further. */ > + break; > + } > + } > } > > /* Return RTNLGRP_IPV4_ROUTE or RTNLGRP_IPV6_ROUTE on success, 0 on parse > @@ -203,6 +233,7 @@ route_table_parse(struct ofpbuf *buf, struct > route_table_msg *change) > [RTA_GATEWAY] = { .type = NL_A_U32, .optional = true }, > [RTA_MARK] = { .type = NL_A_U32, .optional = true }, > [RTA_PREFSRC] = { .type = NL_A_U32, .optional = true }, > + [RTA_TABLE] = { .type = NL_A_U32, .optional = true }, > }; > > static const struct nl_policy policy6[] = { > @@ -211,6 +242,7 @@ route_table_parse(struct ofpbuf *buf, struct > route_table_msg *change) > [RTA_MARK] = { .type = NL_A_U32, .optional = true }, > [RTA_GATEWAY] = { .type = NL_A_IPV6, .optional = true }, > [RTA_PREFSRC] = { .type = NL_A_IPV6, .optional = true }, > + [RTA_TABLE] = { .type = NL_A_U32, .optional = true }, > }; > > struct nlattr *attrs[ARRAY_SIZE(policy)]; > @@ -232,6 +264,7 @@ route_table_parse(struct ofpbuf *buf, struct > route_table_msg *change) > > if (parsed) { > const struct nlmsghdr *nlmsg; > + uint32_t table_id; > int rta_oif; /* Output interface index. */ > > nlmsg = buf->data; > @@ -247,6 +280,19 @@ route_table_parse(struct ofpbuf *buf, struct > route_table_msg *change) > rtm->rtm_type != RTN_LOCAL) { > change->relevant = false; > } > + > + table_id = rtm->rtm_table; > + if (attrs[RTA_TABLE]) { > + table_id = nl_attr_get_u32(attrs[RTA_TABLE]); > + } > + /* Do not consider changes in non-standard routing tables. */ > + if (table_id > + && table_id != RT_TABLE_DEFAULT > + && table_id != RT_TABLE_MAIN > + && table_id != RT_TABLE_LOCAL) { > + change->relevant = false; > + } > + > change->nlmsg_type = nlmsg->nlmsg_type; > change->rd.rtm_dst_len = rtm->rtm_dst_len + (ipv4 ? 96 : 0); > change->rd.local = rtm->rtm_type == RTN_LOCAL; > @@ -312,7 +358,9 @@ static void > route_table_change(const struct route_table_msg *change OVS_UNUSED, > void *aux OVS_UNUSED) > { > - route_table_valid = false; > + if (!change || change->relevant) { > + route_table_valid = false; > + } > } > > static void > diff --git a/tests/system-route.at b/tests/system-route.at > index 114aaebc7..c0ecad6cf 100644 > --- a/tests/system-route.at > +++ b/tests/system-route.at > @@ -64,3 +64,67 @@ Cached: fc00:db8:beef::13/128 dev br0 GW fc00:db8:cafe::1 > SRC fc00:db8:cafe::2]) > > OVS_TRAFFIC_VSWITCHD_STOP > AT_CLEANUP > + > +dnl Checks that OVS doesn't use routes from non-standard tables. > +AT_SETUP([ovs-route - route tables]) > +AT_KEYWORDS([route]) > +OVS_TRAFFIC_VSWITCHD_START() > + > +dnl Create tap port. > +on_exit 'ip link del p1-route' > +AT_CHECK([ip tuntap add name p1-route mode tap]) > +AT_CHECK([ip link set p1-route up]) > + > +dnl Add ip address. > +AT_CHECK([ip addr add 10.0.0.17/24 dev p1-route], [0], [stdout]) > + > +dnl Check that OVS catches route updates. > +OVS_WAIT_UNTIL_EQUAL([ovs-appctl ovs/route/show | grep 'p1-route' | sort], > [dnl > +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17 > +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local]) > + > +dnl Add a route to the main routing table and check that OVS caches > +dnl this new route. > +AT_CHECK([ip route add 10.0.0.18/32 dev p1-route]) > +OVS_WAIT_UNTIL_EQUAL([ovs-appctl ovs/route/show | grep 'p1-route' | sort], > [dnl > +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17 > +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local > +Cached: 10.0.0.18/32 dev p1-route SRC 10.0.0.17]) > + > +dnl Add a route to a custom routing table and check that OVS doesn't cache > it. > +AT_CHECK([ip route add 10.0.0.19/32 dev p1-route table 42]) > +AT_CHECK([ip route show table 42 | grep 'p1-route' | grep -q '10.0.0.19']) > +dnl Give the main thread a chance to act. > +AT_CHECK([ovs-appctl revalidator/wait]) > +dnl Check that OVS didn't learn this route. > +AT_CHECK([ovs-appctl ovs/route/show | grep 'p1-route' | sort], [0], [dnl > +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17 > +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local > +Cached: 10.0.0.18/32 dev p1-route SRC 10.0.0.17 > +]) > + > +dnl Delete a route from the main table and check that OVS removes the route > +dnl from the cache. > +AT_CHECK([ip route del 10.0.0.18/32 dev p1-route]) > +OVS_WAIT_UNTIL_EQUAL([ovs-appctl ovs/route/show | grep 'p1-route' | sort], > [dnl > +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17 > +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local]) > + > +dnl Delete a route from a custom routing table and check that the cache > +dnl dosn't change. > +AT_CHECK([ip route del 10.0.0.19/32 dev p1-route table 42]) > +dnl Give the main thread a chance to act. > +AT_CHECK([ovs-appctl revalidator/wait]) > +dnl Check that the cache is still the same. > +AT_CHECK([ovs-appctl ovs/route/show | grep 'p1-route' | sort], [0], [dnl > +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17 > +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local > +]) > + > +dnl Delete ip address. > +AT_CHECK([ip addr del 10.0.0.17/24 dev p1-route], [0], [stdout]) > +dnl Check that routes were removed from OVS. > +OVS_WAIT_UNTIL([test $(ovs-appctl ovs/route/show | grep -c 'p1-route') -eq 0 > ]) > + > +OVS_TRAFFIC_VSWITCHD_STOP > +AT_CLEANUP _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev