Ilya Maximets <i.maxim...@ovn.org> writes:

> Currently, ovs-vswitchd is subscribed to all the routing changes in the
> kernel.  On each change, it marks the internal routing table cache as
> invalid, then resets it and dumps all the routes from the kernel from
> scratch.  The reason for that is kernel routing updates not being
> reliable in a sense that it's hard to tell which route is getting
> removed or modified.  Userspace application has to track the order in
> which route entries are dumped from the kernel.  Updates can get lost
> or even duplicated and the kernel doesn't provide a good mechanism to
> distinguish one route from another.  To my knowledge, dumping all the
> routes from a kernel after each change is the only way to keep the
> cache consistent.  Some more info can be found in the following never
> addressed issues:
>   https://bugzilla.redhat.com/1337860
>   https://bugzilla.redhat.com/1337855
>
> It seems to be believed that NetworkManager "mostly" does incremental
> updates right.  But it is still not completely correct, will re-dump
> the whole table in certain cases, and it takes a huge amount of very
> complicated code to do the accounting and route comparisons.
>
> Going back to ovs-vswitchd, it currently dumps routes from all the
> routing tables.  If it will get conflicting routes from multiple
> tables, the cache will not be useful.  The routing cache in userspace
> is primarily used for checking the egress port for tunneled traffic
> and this way also detecting link state changes for a tunnel port.
> For userspace datapath it is used for actual routing of the packet
> after sending to a native tunnel.
> With kernel datapath we don't really have a mechanism to know which
> routing table will actually be used by the kernel after encapsulation,
> so our lookups on a cache may be incorrect because of this as well.
>
> So, unless all the relevant routes are in the standard tables, the
> lookup in userspace route cache is unreliable.
>
> Luckily, most setups are not using any complicated routing in
> non-standard tables that OVS has to be aware of.
>
> It is possible, but unlikely, that standard routing tables are
> completely empty while some other custom table is not, and all the OVS
> tunnel traffic is directed to that table.  That would be the only
> scenario where dumping non-standard tables would make sense.  But it
> seems like this kind of setup will likely need a way to tell OVS from
> which table the routes should be taken, or we'll need to dump routing
> rules and keep a separate cache for each table, so we can first match
> on rules and then lookup correct routes in a specific table.  I'm not
> sure if trying to implement all that is justified.
>
> For now, stop considering routes from non-standard tables to avoid
> mixing different tables together and also wasting CPU resources.
>
> This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is
> running on a same host and in a same network namespace with OVS using
> its own custom routing table.
>
> Unfortunately, there seems to be no way to tell the kernel to send
> updates only for particular tables.  So, we'll still receive and parse
> all of them.  But they will not result in a full cache invalidation in
> most cases.
>
> Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps.
> So, we can make use of it and dump only standard tables when we get a
> relevant route update.  NETLINK_GET_STRICT_CHK has to be enabled on
> the socket for filtering to work.  There is no reason to not enable it
> by default, if supported.  It is not used outside of NETLINK_ROUTE.
>
> Fixes: f0e167f0dbad ("route-table: Handle route updates more robustly.")
> Fixes: ea83a2fcd0d3 ("lib: Show tunnel egress interface in ovsdb")
> Reported-at: https://github.com/openvswitch/ovs-issues/issues/185
> Reported-at: 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html
> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
> ---
>  lib/netlink-protocol.h | 10 ++++++
>  lib/netlink-socket.c   |  8 +++++
>  lib/route-table.c      | 80 +++++++++++++++++++++++++++++++++---------
>  tests/system-route.at  | 64 +++++++++++++++++++++++++++++++++
>  4 files changed, 146 insertions(+), 16 deletions(-)
>
> diff --git a/lib/netlink-protocol.h b/lib/netlink-protocol.h
> index 6eaa7035a..e4bb28ac9 100644
> --- a/lib/netlink-protocol.h
> +++ b/lib/netlink-protocol.h
> @@ -155,6 +155,11 @@ enum {
>  #define NLA_TYPE_MASK       ~(NLA_F_NESTED | NLA_F_NET_BYTEORDER)
>  #endif
>  
> +/* Introduced in v4.4. */
> +#ifndef NLM_F_DUMP_FILTERED
> +#define NLM_F_DUMP_FILTERED 0x20
> +#endif
> +
>  /* These were introduced all together in 2.6.14.  (We want our programs to
>   * support the newer kernel features even if compiled with older headers.) */
>  #ifndef NETLINK_ADD_MEMBERSHIP
> @@ -168,6 +173,11 @@ enum {
>  #define NETLINK_LISTEN_ALL_NSID 8
>  #endif
>  
> +/* Strict checking of netlink arguments introduced in Linux kernel v4.20. */
> +#ifndef NETLINK_GET_STRICT_CHK
> +#define NETLINK_GET_STRICT_CHK 12
> +#endif
> +
>  /* These were introduced all together in 2.6.23.  (We want our programs to
>   * support the newer kernel features even if compiled with older headers.) */
>  #ifndef CTRL_ATTR_MCAST_GRP_MAX
> diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
> index 80da20d9f..eea880b2c 100644
> --- a/lib/netlink-socket.c
> +++ b/lib/netlink-socket.c
> @@ -205,6 +205,14 @@ nl_sock_create(int protocol, struct nl_sock **sockp)
>          }
>      }
>  
> +    /* Strict checking only supported for NETLINK_ROUTE. */
> +    if (protocol == NETLINK_ROUTE
> +        && setsockopt(sock->fd, SOL_NETLINK, NETLINK_GET_STRICT_CHK,
> +                      &one, sizeof one) < 0) {
> +        VLOG_INFO("netlink: could not enable strict checking (%s)",
> +                  ovs_strerror(errno));

Maybe this message is a bit scary for users - but I'm not sure what
message might be less alarming.  Something like:

netlink: strict checking off (%s)

Not a big deal, just a kindof nit.  I hope it doesn't become VLOG_WARN
later or something, and want to avoid someone reading it and thinking
that it is a misclassified warning.

I don't think it could happen often, but if we have opened a large
number of nl sockets (maybe some large nl_pool_alloc) then we end up
with quite a few of these messages as well.  Maybe we can just vlog this
once?

> +    }
> +
>      retval = get_socket_rcvbuf(sock->fd);
>      if (retval < 0) {
>          retval = -retval;
> diff --git a/lib/route-table.c b/lib/route-table.c
> index 9927dcc18..f1fe32714 100644
> --- a/lib/route-table.c
> +++ b/lib/route-table.c
> @@ -26,6 +26,7 @@
>  #include <linux/rtnetlink.h>
>  #include <net/if.h>
>  
> +#include "coverage.h"
>  #include "hash.h"
>  #include "netdev.h"
>  #include "netlink.h"
> @@ -44,6 +45,8 @@
>  
>  VLOG_DEFINE_THIS_MODULE(route_table);
>  
> +COVERAGE_DEFINE(route_table_dump);
> +
>  struct route_data {
>      /* Copied from struct rtmsg. */
>      unsigned char rtm_dst_len;
> @@ -80,7 +83,7 @@ static struct nln_notifier *name_notifier = NULL;
>  
>  static bool route_table_valid = false;
>  
> -static int route_table_reset(void);
> +static void route_table_reset(void);
>  static void route_table_handle_msg(const struct route_table_msg *);
>  static int route_table_parse(struct ofpbuf *, struct route_table_msg *);
>  static void route_table_change(const struct route_table_msg *, void *);
> @@ -153,26 +156,22 @@ route_table_wait(void)
>      ovs_mutex_unlock(&route_table_mutex);
>  }
>  
> -static int
> -route_table_reset(void)
> +static bool
> +route_table_dump_one_table(unsigned char id)
>  {
> -    struct nl_dump dump;
> -    struct rtgenmsg *rtgenmsg;
>      uint64_t reply_stub[NL_DUMP_BUFSIZE / 8];
>      struct ofpbuf request, reply, buf;
> -
> -    route_map_clear();
> -    netdev_get_addrs_list_flush();
> -    route_table_valid = true;
> -    rt_change_seq++;
> +    struct rtmsg *rq_msg;
> +    bool filtered = true;
> +    struct nl_dump dump;
>  
>      ofpbuf_init(&request, 0);
>  
> -    nl_msg_put_nlmsghdr(&request, sizeof *rtgenmsg, RTM_GETROUTE,
> -                        NLM_F_REQUEST);
> +    nl_msg_put_nlmsghdr(&request, sizeof *rq_msg, RTM_GETROUTE, 
> NLM_F_REQUEST);
>  
> -    rtgenmsg = ofpbuf_put_zeros(&request, sizeof *rtgenmsg);
> -    rtgenmsg->rtgen_family = AF_UNSPEC;
> +    rq_msg = ofpbuf_put_zeros(&request, sizeof *rq_msg);
> +    rq_msg->rtm_family = AF_UNSPEC;
> +    rq_msg->rtm_table = id;

Any reason to not consider setting RTA_TABLE attribute rather than use
rtm_table.  RTA_TABLE supports the extended range, rather than just the
8bits range that rtm_table can fit.

>  
>      nl_dump_start(&dump, NETLINK_ROUTE, &request);
>      ofpbuf_uninit(&request);
> @@ -182,12 +181,43 @@ route_table_reset(void)
>          struct route_table_msg msg;
>  
>          if (route_table_parse(&reply, &msg)) {
> +            struct nlmsghdr *nlmsghdr = nl_msg_nlmsghdr(&reply);
> +
> +            /* Older kernels do not support filtering. */
> +            if (!(nlmsghdr->nlmsg_flags & NLM_F_DUMP_FILTERED)) {
> +                filtered = false;
> +            }
>              route_table_handle_msg(&msg);
>          }
>      }
>      ofpbuf_uninit(&buf);
> +    nl_dump_done(&dump);
> +
> +    return filtered;
> +}
> +
> +static void
> +route_table_reset(void)
> +{
> +    unsigned char tables[] = {
> +        RT_TABLE_DEFAULT,
> +        RT_TABLE_MAIN,
> +        RT_TABLE_LOCAL,
> +    };
>  
> -    return nl_dump_done(&dump);
> +    route_map_clear();
> +    netdev_get_addrs_list_flush();
> +    route_table_valid = true;
> +    rt_change_seq++;
> +
> +    COVERAGE_INC(route_table_dump);
> +
> +    for (size_t i = 0; i < ARRAY_SIZE(tables); i++) {
> +        if (!route_table_dump_one_table(tables[i])) {
> +            /* Got unfiltered reply, no need to dump further. */
> +            break;
> +        }
> +    }
>  }
>  
>  /* Return RTNLGRP_IPV4_ROUTE or RTNLGRP_IPV6_ROUTE on success, 0 on parse
> @@ -203,6 +233,7 @@ route_table_parse(struct ofpbuf *buf, struct 
> route_table_msg *change)
>          [RTA_GATEWAY] = { .type = NL_A_U32, .optional = true },
>          [RTA_MARK] = { .type = NL_A_U32, .optional = true },
>          [RTA_PREFSRC] = { .type = NL_A_U32, .optional = true },
> +        [RTA_TABLE] = { .type = NL_A_U32, .optional = true },
>      };
>  
>      static const struct nl_policy policy6[] = {
> @@ -211,6 +242,7 @@ route_table_parse(struct ofpbuf *buf, struct 
> route_table_msg *change)
>          [RTA_MARK] = { .type = NL_A_U32, .optional = true },
>          [RTA_GATEWAY] = { .type = NL_A_IPV6, .optional = true },
>          [RTA_PREFSRC] = { .type = NL_A_IPV6, .optional = true },
> +        [RTA_TABLE] = { .type = NL_A_U32, .optional = true },
>      };
>  
>      struct nlattr *attrs[ARRAY_SIZE(policy)];
> @@ -232,6 +264,7 @@ route_table_parse(struct ofpbuf *buf, struct 
> route_table_msg *change)
>  
>      if (parsed) {
>          const struct nlmsghdr *nlmsg;
> +        uint32_t table_id;
>          int rta_oif;      /* Output interface index. */
>  
>          nlmsg = buf->data;
> @@ -247,6 +280,19 @@ route_table_parse(struct ofpbuf *buf, struct 
> route_table_msg *change)
>              rtm->rtm_type != RTN_LOCAL) {
>              change->relevant = false;
>          }
> +
> +        table_id = rtm->rtm_table;
> +        if (attrs[RTA_TABLE]) {
> +            table_id = nl_attr_get_u32(attrs[RTA_TABLE]);
> +        }
> +        /* Do not consider changes in non-standard routing tables. */
> +        if (table_id
> +            && table_id != RT_TABLE_DEFAULT
> +            && table_id != RT_TABLE_MAIN
> +            && table_id != RT_TABLE_LOCAL) {
> +            change->relevant = false;
> +        }
> +
>          change->nlmsg_type     = nlmsg->nlmsg_type;
>          change->rd.rtm_dst_len = rtm->rtm_dst_len + (ipv4 ? 96 : 0);
>          change->rd.local = rtm->rtm_type == RTN_LOCAL;
> @@ -312,7 +358,9 @@ static void
>  route_table_change(const struct route_table_msg *change OVS_UNUSED,
>                     void *aux OVS_UNUSED)
>  {
> -    route_table_valid = false;
> +    if (!change || change->relevant) {
> +        route_table_valid = false;
> +    }
>  }
>  
>  static void
> diff --git a/tests/system-route.at b/tests/system-route.at
> index 114aaebc7..c0ecad6cf 100644
> --- a/tests/system-route.at
> +++ b/tests/system-route.at
> @@ -64,3 +64,67 @@ Cached: fc00:db8:beef::13/128 dev br0 GW fc00:db8:cafe::1 
> SRC fc00:db8:cafe::2])
>  
>  OVS_TRAFFIC_VSWITCHD_STOP
>  AT_CLEANUP
> +
> +dnl Checks that OVS doesn't use routes from non-standard tables.
> +AT_SETUP([ovs-route - route tables])
> +AT_KEYWORDS([route])
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +dnl Create tap port.
> +on_exit 'ip link del p1-route'
> +AT_CHECK([ip tuntap add name p1-route mode tap])
> +AT_CHECK([ip link set p1-route up])
> +
> +dnl Add ip address.
> +AT_CHECK([ip addr add 10.0.0.17/24 dev p1-route], [0], [stdout])
> +
> +dnl Check that OVS catches route updates.
> +OVS_WAIT_UNTIL_EQUAL([ovs-appctl ovs/route/show | grep 'p1-route' | sort], 
> [dnl
> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local])
> +
> +dnl Add a route to the main routing table and check that OVS caches
> +dnl this new route.
> +AT_CHECK([ip route add 10.0.0.18/32 dev p1-route])
> +OVS_WAIT_UNTIL_EQUAL([ovs-appctl ovs/route/show | grep 'p1-route' | sort], 
> [dnl
> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local
> +Cached: 10.0.0.18/32 dev p1-route SRC 10.0.0.17])
> +
> +dnl Add a route to a custom routing table and check that OVS doesn't cache 
> it.
> +AT_CHECK([ip route add 10.0.0.19/32 dev p1-route table 42])
> +AT_CHECK([ip route show table 42 | grep 'p1-route' | grep -q '10.0.0.19'])
> +dnl Give the main thread a chance to act.
> +AT_CHECK([ovs-appctl revalidator/wait])
> +dnl Check that OVS didn't learn this route.
> +AT_CHECK([ovs-appctl ovs/route/show | grep 'p1-route' | sort], [0], [dnl
> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local
> +Cached: 10.0.0.18/32 dev p1-route SRC 10.0.0.17
> +])
> +
> +dnl Delete a route from the main table and check that OVS removes the route
> +dnl from the cache.
> +AT_CHECK([ip route del 10.0.0.18/32 dev p1-route])
> +OVS_WAIT_UNTIL_EQUAL([ovs-appctl ovs/route/show | grep 'p1-route' | sort], 
> [dnl
> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local])
> +
> +dnl Delete a route from a custom routing table and check that the cache
> +dnl dosn't change.
> +AT_CHECK([ip route del 10.0.0.19/32 dev p1-route table 42])
> +dnl Give the main thread a chance to act.
> +AT_CHECK([ovs-appctl revalidator/wait])
> +dnl Check that the cache is still the same.
> +AT_CHECK([ovs-appctl ovs/route/show | grep 'p1-route' | sort], [0], [dnl
> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local
> +])
> +
> +dnl Delete ip address.
> +AT_CHECK([ip addr del 10.0.0.17/24 dev p1-route], [0], [stdout])
> +dnl Check that routes were removed from OVS.
> +OVS_WAIT_UNTIL([test $(ovs-appctl ovs/route/show | grep -c 'p1-route') -eq 0 
> ])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to