On 3/19/24 20:56, Aaron Conole wrote:
> Ilya Maximets <i.maxim...@ovn.org> writes:
> 
>> Currently, ovs-vswitchd is subscribed to all the routing changes in the
>> kernel.  On each change, it marks the internal routing table cache as
>> invalid, then resets it and dumps all the routes from the kernel from
>> scratch.  The reason for that is kernel routing updates not being
>> reliable in a sense that it's hard to tell which route is getting
>> removed or modified.  Userspace application has to track the order in
>> which route entries are dumped from the kernel.  Updates can get lost
>> or even duplicated and the kernel doesn't provide a good mechanism to
>> distinguish one route from another.  To my knowledge, dumping all the
>> routes from a kernel after each change is the only way to keep the
>> cache consistent.  Some more info can be found in the following never
>> addressed issues:
>>   https://bugzilla.redhat.com/1337860
>>   https://bugzilla.redhat.com/1337855
>>
>> It seems to be believed that NetworkManager "mostly" does incremental
>> updates right.  But it is still not completely correct, will re-dump
>> the whole table in certain cases, and it takes a huge amount of very
>> complicated code to do the accounting and route comparisons.
>>
>> Going back to ovs-vswitchd, it currently dumps routes from all the
>> routing tables.  If it will get conflicting routes from multiple
>> tables, the cache will not be useful.  The routing cache in userspace
>> is primarily used for checking the egress port for tunneled traffic
>> and this way also detecting link state changes for a tunnel port.
>> For userspace datapath it is used for actual routing of the packet
>> after sending to a native tunnel.
>> With kernel datapath we don't really have a mechanism to know which
>> routing table will actually be used by the kernel after encapsulation,
>> so our lookups on a cache may be incorrect because of this as well.
>>
>> So, unless all the relevant routes are in the standard tables, the
>> lookup in userspace route cache is unreliable.
>>
>> Luckily, most setups are not using any complicated routing in
>> non-standard tables that OVS has to be aware of.
>>
>> It is possible, but unlikely, that standard routing tables are
>> completely empty while some other custom table is not, and all the OVS
>> tunnel traffic is directed to that table.  That would be the only
>> scenario where dumping non-standard tables would make sense.  But it
>> seems like this kind of setup will likely need a way to tell OVS from
>> which table the routes should be taken, or we'll need to dump routing
>> rules and keep a separate cache for each table, so we can first match
>> on rules and then lookup correct routes in a specific table.  I'm not
>> sure if trying to implement all that is justified.
>>
>> For now, stop considering routes from non-standard tables to avoid
>> mixing different tables together and also wasting CPU resources.
>>
>> This fixes a high CPU usage in ovs-vswitchd in case a BGP daemon is
>> running on a same host and in a same network namespace with OVS using
>> its own custom routing table.
>>
>> Unfortunately, there seems to be no way to tell the kernel to send
>> updates only for particular tables.  So, we'll still receive and parse
>> all of them.  But they will not result in a full cache invalidation in
>> most cases.
>>
>> Linux kernel v4.20 introduced filtering support for RTM_GETROUTE dumps.
>> So, we can make use of it and dump only standard tables when we get a
>> relevant route update.  NETLINK_GET_STRICT_CHK has to be enabled on
>> the socket for filtering to work.  There is no reason to not enable it
>> by default, if supported.  It is not used outside of NETLINK_ROUTE.
>>
>> Fixes: f0e167f0dbad ("route-table: Handle route updates more robustly.")
>> Fixes: ea83a2fcd0d3 ("lib: Show tunnel egress interface in ovsdb")
>> Reported-at: https://github.com/openvswitch/ovs-issues/issues/185
>> Reported-at: 
>> https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052091.html
>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
>> ---
>>  lib/netlink-protocol.h | 10 ++++++
>>  lib/netlink-socket.c   |  8 +++++
>>  lib/route-table.c      | 80 +++++++++++++++++++++++++++++++++---------
>>  tests/system-route.at  | 64 +++++++++++++++++++++++++++++++++
>>  4 files changed, 146 insertions(+), 16 deletions(-)
>>
>> diff --git a/lib/netlink-protocol.h b/lib/netlink-protocol.h
>> index 6eaa7035a..e4bb28ac9 100644
>> --- a/lib/netlink-protocol.h
>> +++ b/lib/netlink-protocol.h
>> @@ -155,6 +155,11 @@ enum {
>>  #define NLA_TYPE_MASK       ~(NLA_F_NESTED | NLA_F_NET_BYTEORDER)
>>  #endif
>>  
>> +/* Introduced in v4.4. */
>> +#ifndef NLM_F_DUMP_FILTERED
>> +#define NLM_F_DUMP_FILTERED 0x20
>> +#endif
>> +
>>  /* These were introduced all together in 2.6.14.  (We want our programs to
>>   * support the newer kernel features even if compiled with older headers.) 
>> */
>>  #ifndef NETLINK_ADD_MEMBERSHIP
>> @@ -168,6 +173,11 @@ enum {
>>  #define NETLINK_LISTEN_ALL_NSID 8
>>  #endif
>>  
>> +/* Strict checking of netlink arguments introduced in Linux kernel v4.20. */
>> +#ifndef NETLINK_GET_STRICT_CHK
>> +#define NETLINK_GET_STRICT_CHK 12
>> +#endif
>> +
>>  /* These were introduced all together in 2.6.23.  (We want our programs to
>>   * support the newer kernel features even if compiled with older headers.) 
>> */
>>  #ifndef CTRL_ATTR_MCAST_GRP_MAX
>> diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
>> index 80da20d9f..eea880b2c 100644
>> --- a/lib/netlink-socket.c
>> +++ b/lib/netlink-socket.c
>> @@ -205,6 +205,14 @@ nl_sock_create(int protocol, struct nl_sock **sockp)
>>          }
>>      }
>>  
>> +    /* Strict checking only supported for NETLINK_ROUTE. */
>> +    if (protocol == NETLINK_ROUTE
>> +        && setsockopt(sock->fd, SOL_NETLINK, NETLINK_GET_STRICT_CHK,
>> +                      &one, sizeof one) < 0) {
>> +        VLOG_INFO("netlink: could not enable strict checking (%s)",
>> +                  ovs_strerror(errno));
> 
> Maybe this message is a bit scary for users - but I'm not sure what
> message might be less alarming.  Something like:
> 
> netlink: strict checking off (%s)
> 
> Not a big deal, just a kindof nit.  I hope it doesn't become VLOG_WARN
> later or something, and want to avoid someone reading it and thinking
> that it is a misclassified warning.

How about we do:

  VLOG(errno == ENOPROTOOPT ? VLL_DBG : VLL_WARN, ... );

i.e. keep it as a debug message for kernels that do not support it
and WARN if something actually went wrong.  What do you think?

> 
> I don't think it could happen often, but if we have opened a large
> number of nl sockets (maybe some large nl_pool_alloc) then we end up
> with quite a few of these messages as well.  Maybe we can just vlog this
> once?

We currently have a single place that opens NETLINK_ROUTE sockets, and
pools are per-protocol, so we should not have more than one of these.
Also should not be a problem with DBG.

> 
>> +    }
>> +
>>      retval = get_socket_rcvbuf(sock->fd);
>>      if (retval < 0) {
>>          retval = -retval;
>> diff --git a/lib/route-table.c b/lib/route-table.c
>> index 9927dcc18..f1fe32714 100644
>> --- a/lib/route-table.c
>> +++ b/lib/route-table.c
>> @@ -26,6 +26,7 @@
>>  #include <linux/rtnetlink.h>
>>  #include <net/if.h>
>>  
>> +#include "coverage.h"
>>  #include "hash.h"
>>  #include "netdev.h"
>>  #include "netlink.h"
>> @@ -44,6 +45,8 @@
>>  
>>  VLOG_DEFINE_THIS_MODULE(route_table);
>>  
>> +COVERAGE_DEFINE(route_table_dump);
>> +
>>  struct route_data {
>>      /* Copied from struct rtmsg. */
>>      unsigned char rtm_dst_len;
>> @@ -80,7 +83,7 @@ static struct nln_notifier *name_notifier = NULL;
>>  
>>  static bool route_table_valid = false;
>>  
>> -static int route_table_reset(void);
>> +static void route_table_reset(void);
>>  static void route_table_handle_msg(const struct route_table_msg *);
>>  static int route_table_parse(struct ofpbuf *, struct route_table_msg *);
>>  static void route_table_change(const struct route_table_msg *, void *);
>> @@ -153,26 +156,22 @@ route_table_wait(void)
>>      ovs_mutex_unlock(&route_table_mutex);
>>  }
>>  
>> -static int
>> -route_table_reset(void)
>> +static bool
>> +route_table_dump_one_table(unsigned char id)
>>  {
>> -    struct nl_dump dump;
>> -    struct rtgenmsg *rtgenmsg;
>>      uint64_t reply_stub[NL_DUMP_BUFSIZE / 8];
>>      struct ofpbuf request, reply, buf;
>> -
>> -    route_map_clear();
>> -    netdev_get_addrs_list_flush();
>> -    route_table_valid = true;
>> -    rt_change_seq++;
>> +    struct rtmsg *rq_msg;
>> +    bool filtered = true;
>> +    struct nl_dump dump;
>>  
>>      ofpbuf_init(&request, 0);
>>  
>> -    nl_msg_put_nlmsghdr(&request, sizeof *rtgenmsg, RTM_GETROUTE,
>> -                        NLM_F_REQUEST);
>> +    nl_msg_put_nlmsghdr(&request, sizeof *rq_msg, RTM_GETROUTE, 
>> NLM_F_REQUEST);
>>  
>> -    rtgenmsg = ofpbuf_put_zeros(&request, sizeof *rtgenmsg);
>> -    rtgenmsg->rtgen_family = AF_UNSPEC;
>> +    rq_msg = ofpbuf_put_zeros(&request, sizeof *rq_msg);
>> +    rq_msg->rtm_family = AF_UNSPEC;
>> +    rq_msg->rtm_table = id;
> 
> Any reason to not consider setting RTA_TABLE attribute rather than use
> rtm_table.  RTA_TABLE supports the extended range, rather than just the
> 8bits range that rtm_table can fit.

I just wanted to avoid unnecessary memory allocation for an attribute.
All the tables we care about are in 8 bit range.  And compiler protects
us from an integer overflow, since all types are set to unsigned char.

We can add RTA_TABLE attribute, but we'll basically add the same field
twice as we're still putting the rtm_table in the message.

We could put RTA_TABLE just to be extra safe, but I'm not sure.

Does that make sense?

> 
>>  
>>      nl_dump_start(&dump, NETLINK_ROUTE, &request);
>>      ofpbuf_uninit(&request);
>> @@ -182,12 +181,43 @@ route_table_reset(void)
>>          struct route_table_msg msg;
>>  
>>          if (route_table_parse(&reply, &msg)) {
>> +            struct nlmsghdr *nlmsghdr = nl_msg_nlmsghdr(&reply);
>> +
>> +            /* Older kernels do not support filtering. */
>> +            if (!(nlmsghdr->nlmsg_flags & NLM_F_DUMP_FILTERED)) {
>> +                filtered = false;
>> +            }
>>              route_table_handle_msg(&msg);
>>          }
>>      }
>>      ofpbuf_uninit(&buf);
>> +    nl_dump_done(&dump);
>> +
>> +    return filtered;
>> +}
>> +
>> +static void
>> +route_table_reset(void)
>> +{
>> +    unsigned char tables[] = {
>> +        RT_TABLE_DEFAULT,
>> +        RT_TABLE_MAIN,
>> +        RT_TABLE_LOCAL,
>> +    };
>>  
>> -    return nl_dump_done(&dump);
>> +    route_map_clear();
>> +    netdev_get_addrs_list_flush();
>> +    route_table_valid = true;
>> +    rt_change_seq++;
>> +
>> +    COVERAGE_INC(route_table_dump);
>> +
>> +    for (size_t i = 0; i < ARRAY_SIZE(tables); i++) {
>> +        if (!route_table_dump_one_table(tables[i])) {
>> +            /* Got unfiltered reply, no need to dump further. */
>> +            break;
>> +        }
>> +    }
>>  }
>>  
>>  /* Return RTNLGRP_IPV4_ROUTE or RTNLGRP_IPV6_ROUTE on success, 0 on parse
>> @@ -203,6 +233,7 @@ route_table_parse(struct ofpbuf *buf, struct 
>> route_table_msg *change)
>>          [RTA_GATEWAY] = { .type = NL_A_U32, .optional = true },
>>          [RTA_MARK] = { .type = NL_A_U32, .optional = true },
>>          [RTA_PREFSRC] = { .type = NL_A_U32, .optional = true },
>> +        [RTA_TABLE] = { .type = NL_A_U32, .optional = true },
>>      };
>>  
>>      static const struct nl_policy policy6[] = {
>> @@ -211,6 +242,7 @@ route_table_parse(struct ofpbuf *buf, struct 
>> route_table_msg *change)
>>          [RTA_MARK] = { .type = NL_A_U32, .optional = true },
>>          [RTA_GATEWAY] = { .type = NL_A_IPV6, .optional = true },
>>          [RTA_PREFSRC] = { .type = NL_A_IPV6, .optional = true },
>> +        [RTA_TABLE] = { .type = NL_A_U32, .optional = true },
>>      };
>>  
>>      struct nlattr *attrs[ARRAY_SIZE(policy)];
>> @@ -232,6 +264,7 @@ route_table_parse(struct ofpbuf *buf, struct 
>> route_table_msg *change)
>>  
>>      if (parsed) {
>>          const struct nlmsghdr *nlmsg;
>> +        uint32_t table_id;
>>          int rta_oif;      /* Output interface index. */
>>  
>>          nlmsg = buf->data;
>> @@ -247,6 +280,19 @@ route_table_parse(struct ofpbuf *buf, struct 
>> route_table_msg *change)
>>              rtm->rtm_type != RTN_LOCAL) {
>>              change->relevant = false;
>>          }
>> +
>> +        table_id = rtm->rtm_table;
>> +        if (attrs[RTA_TABLE]) {
>> +            table_id = nl_attr_get_u32(attrs[RTA_TABLE]);
>> +        }
>> +        /* Do not consider changes in non-standard routing tables. */
>> +        if (table_id
>> +            && table_id != RT_TABLE_DEFAULT
>> +            && table_id != RT_TABLE_MAIN
>> +            && table_id != RT_TABLE_LOCAL) {
>> +            change->relevant = false;
>> +        }
>> +
>>          change->nlmsg_type     = nlmsg->nlmsg_type;
>>          change->rd.rtm_dst_len = rtm->rtm_dst_len + (ipv4 ? 96 : 0);
>>          change->rd.local = rtm->rtm_type == RTN_LOCAL;
>> @@ -312,7 +358,9 @@ static void
>>  route_table_change(const struct route_table_msg *change OVS_UNUSED,
>>                     void *aux OVS_UNUSED)
>>  {
>> -    route_table_valid = false;
>> +    if (!change || change->relevant) {
>> +        route_table_valid = false;
>> +    }
>>  }
>>  
>>  static void
>> diff --git a/tests/system-route.at b/tests/system-route.at
>> index 114aaebc7..c0ecad6cf 100644
>> --- a/tests/system-route.at
>> +++ b/tests/system-route.at
>> @@ -64,3 +64,67 @@ Cached: fc00:db8:beef::13/128 dev br0 GW fc00:db8:cafe::1 
>> SRC fc00:db8:cafe::2])
>>  
>>  OVS_TRAFFIC_VSWITCHD_STOP
>>  AT_CLEANUP
>> +
>> +dnl Checks that OVS doesn't use routes from non-standard tables.
>> +AT_SETUP([ovs-route - route tables])
>> +AT_KEYWORDS([route])
>> +OVS_TRAFFIC_VSWITCHD_START()
>> +
>> +dnl Create tap port.
>> +on_exit 'ip link del p1-route'
>> +AT_CHECK([ip tuntap add name p1-route mode tap])
>> +AT_CHECK([ip link set p1-route up])
>> +
>> +dnl Add ip address.
>> +AT_CHECK([ip addr add 10.0.0.17/24 dev p1-route], [0], [stdout])
>> +
>> +dnl Check that OVS catches route updates.
>> +OVS_WAIT_UNTIL_EQUAL([ovs-appctl ovs/route/show | grep 'p1-route' | sort], 
>> [dnl
>> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
>> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local])
>> +
>> +dnl Add a route to the main routing table and check that OVS caches
>> +dnl this new route.
>> +AT_CHECK([ip route add 10.0.0.18/32 dev p1-route])
>> +OVS_WAIT_UNTIL_EQUAL([ovs-appctl ovs/route/show | grep 'p1-route' | sort], 
>> [dnl
>> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
>> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local
>> +Cached: 10.0.0.18/32 dev p1-route SRC 10.0.0.17])
>> +
>> +dnl Add a route to a custom routing table and check that OVS doesn't cache 
>> it.
>> +AT_CHECK([ip route add 10.0.0.19/32 dev p1-route table 42])
>> +AT_CHECK([ip route show table 42 | grep 'p1-route' | grep -q '10.0.0.19'])
>> +dnl Give the main thread a chance to act.
>> +AT_CHECK([ovs-appctl revalidator/wait])
>> +dnl Check that OVS didn't learn this route.
>> +AT_CHECK([ovs-appctl ovs/route/show | grep 'p1-route' | sort], [0], [dnl
>> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
>> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local
>> +Cached: 10.0.0.18/32 dev p1-route SRC 10.0.0.17
>> +])
>> +
>> +dnl Delete a route from the main table and check that OVS removes the route
>> +dnl from the cache.
>> +AT_CHECK([ip route del 10.0.0.18/32 dev p1-route])
>> +OVS_WAIT_UNTIL_EQUAL([ovs-appctl ovs/route/show | grep 'p1-route' | sort], 
>> [dnl
>> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
>> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local])
>> +
>> +dnl Delete a route from a custom routing table and check that the cache
>> +dnl dosn't change.
>> +AT_CHECK([ip route del 10.0.0.19/32 dev p1-route table 42])
>> +dnl Give the main thread a chance to act.
>> +AT_CHECK([ovs-appctl revalidator/wait])
>> +dnl Check that the cache is still the same.
>> +AT_CHECK([ovs-appctl ovs/route/show | grep 'p1-route' | sort], [0], [dnl
>> +Cached: 10.0.0.0/24 dev p1-route SRC 10.0.0.17
>> +Cached: 10.0.0.17/32 dev p1-route SRC 10.0.0.17 local
>> +])
>> +
>> +dnl Delete ip address.
>> +AT_CHECK([ip addr del 10.0.0.17/24 dev p1-route], [0], [stdout])
>> +dnl Check that routes were removed from OVS.
>> +OVS_WAIT_UNTIL([test $(ovs-appctl ovs/route/show | grep -c 'p1-route') -eq 
>> 0 ])
>> +
>> +OVS_TRAFFIC_VSWITCHD_STOP
>> +AT_CLEANUP
> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to