On 11/7/24 17:34, Lorenzo Bianconi wrote:
> On Nov 06, Dumitru Ceara wrote:
>> Hi Lorenzo,
>>
>> On 9/19/24 19:08, Lorenzo Bianconi wrote:
>>> Introduce the capbility to periodically send ARP/ND packets for
>>> ECMP nexthops in order to resolve their L2 address. This is a preliminary
>>> patch to introduce the capability to flush stale ECMP CT entries.
>>>
>>> Signed-off-by: Lorenzo Bianconi <[email protected]>
>>> ---
>>>  controller/ovn-controller.8.xml |  11 ++
>>>  controller/ovn-controller.c     |   2 +
>>>  controller/pinctrl.c            | 269 +++++++++++++++++++++++++++++---
>>>  controller/pinctrl.h            |   2 +
>>>  4 files changed, 261 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/controller/ovn-controller.8.xml 
>>> b/controller/ovn-controller.8.xml
>>> index aeaa374c1..c7fcad1fa 100644
>>> --- a/controller/ovn-controller.8.xml
>>> +++ b/controller/ovn-controller.8.xml
>>> @@ -385,6 +385,17 @@
>>>          cap for the exponential backoff used by <code>ovn-controller</code>
>>>          to send GARPs packets.
>>>        </dd>
>>> +      <dt><code>external_ids:arp-nd-max-timeout-sec</code></dt>
>>> +      <dd>
>>> +        When used, this configuration value specifies the maximum timeout
>>> +        (in seconds) between two consecutive ARP/ND packets sent by
>>> +        <code>ovn-controller</code> to resolve ECMP nexthop mac address.
>>> +        <code>ovn-controller</code> by default sends just 4 ARP/ND packets
>>
>> So we stop ARPing the next hop after we did it 4 times?  This sounds
>> wrong.  The point of the feature is to detect when next hops change mac
>> address, right?
>>
>> Ah, after reading the rest of the patch you do:
>>
>> continuous_arp_nd = !!max_arp_nd_timeout;
>>
>> So if the option is not set (default) ovn-controller continously ARPs.
>> So the documentation here is misleading in my opinion.  I think it
>> should be "by default continuously sends ARPs.." or something along
>> those lines.
>>
>> Also, we miss a NEWS entry for this new option.
> 
> ack, I will fix it.
> 
>>
>>> +        with an exponential backoff timeout.
>>> +        Setting <code>external_ids:arp-nd-max-timeout-sec</code> allows to
>>> +        cap for the exponential backoff used by <code>ovn-controller</code>
>>> +        to send ARPs/NDs packets.
>>> +      </dd>
>>>        <dt><code>external_ids:ovn-bridge-remote</code></dt>
>>>        <dd>
>>>          <p>
>>> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
>>> index 64be08ce3..e72d28731 100644
>>> --- a/controller/ovn-controller.c
>>> +++ b/controller/ovn-controller.c
>>> @@ -5653,6 +5653,8 @@ main(int argc, char *argv[])
>>>                                      sbrec_mac_binding_table_get(
>>>                                          ovnsb_idl_loop.idl),
>>>                                      
>>> sbrec_bfd_table_get(ovnsb_idl_loop.idl),
>>> +                                    sbrec_ecmp_nexthop_table_get(
>>> +                                        ovnsb_idl_loop.idl),
>>>                                      br_int, chassis,
>>>                                      &runtime_data->local_datapaths,
>>>                                      &runtime_data->active_tunnels,
>>> diff --git a/controller/pinctrl.c b/controller/pinctrl.c
>>> index c86b4f940..daf0bfa41 100644
>>> --- a/controller/pinctrl.c
>>> +++ b/controller/pinctrl.c
>>> @@ -170,6 +170,9 @@ static struct seq *pinctrl_main_seq;
>>>  static long long int garp_rarp_max_timeout = GARP_RARP_DEF_MAX_TIMEOUT;
>>>  static bool garp_rarp_continuous;
>>>  
>>> +static long long int arp_nd_max_timeout = GARP_RARP_DEF_MAX_TIMEOUT;
>>> +static bool arp_nd_continuous;
>>
>> I know we use individual global vars for garp_rarp_* variables too but
>> maybe it's time to refactor this into a structure instead of duplicating
>> this for every user of the "garp_rarp" mechanism.
> 
> like a global struct with garp_rarp/arp_nd info?
> 

Maybe at least one structure to store max_timeout and _continuous.  And
two variables of the type of the structure?  We can also add 'struct
send_garp_rarp_data` to it (with a better name).

>>
>>> +
>>>  static void *pinctrl_handler(void *arg);
>>>  
>>>  struct pinctrl {
>>> @@ -229,13 +232,17 @@ static void run_activated_ports(
>>>      const struct sbrec_chassis *chassis);
>>>  
>>>  static void init_send_garps_rarps(void);
>>> +static void init_send_arps_nds(void);
>>>  static void destroy_send_garps_rarps(void);
>>> +static void destroy_send_arps_nds(void);
>>>  static void send_garp_rarp_wait(long long int send_garp_rarp_time);
>>> +static void send_arp_nd_wait(long long int send_arp_nd_time);
>>
>> Same comment about duplication here.
> 
> do you prefer to have a single function to manage both of them?
> 

The init/destroy functions are identical, we can have one that receives
the structure to initialize as argument.  I suspect we can do something
similar for all garp/rarp/arp_nd processing.

>>
>>>  static void send_garp_rarp_prepare(
>>>      struct ovsdb_idl_txn *ovnsb_idl_txn,
>>>      struct ovsdb_idl_index *sbrec_port_binding_by_datapath,
>>>      struct ovsdb_idl_index *sbrec_port_binding_by_name,
>>>      struct ovsdb_idl_index *sbrec_mac_binding_by_lport_ip,
>>> +    const struct sbrec_ecmp_nexthop_table *ecmp_nh_table,
>>>      const struct ovsrec_bridge *,
>>>      const struct sbrec_chassis *,
>>>      const struct hmap *local_datapaths,
>>> @@ -245,6 +252,9 @@ static void send_garp_rarp_prepare(
>>>  static void send_garp_rarp_run(struct rconn *swconn,
>>>                                 long long int *send_garp_rarp_time)
>>>      OVS_REQUIRES(pinctrl_mutex);
>>> +static void send_arp_nd_run(struct rconn *swconn,
>>> +                            long long int *send_arp_nd_time)
>>> +    OVS_REQUIRES(pinctrl_mutex);
>>>  static void pinctrl_handle_nd_na(struct rconn *swconn,
>>>                                   const struct flow *ip_flow,
>>>                                   const struct match *md,
>>> @@ -554,6 +564,7 @@ pinctrl_init(void)
>>>  {
>>>      init_put_mac_bindings();
>>>      init_send_garps_rarps();
>>> +    init_send_arps_nds();
>>>      init_ipv6_ras();
>>>      init_ipv6_prefixd();
>>>      init_buffered_packets_ctx();
>>> @@ -4000,6 +4011,7 @@ pinctrl_handler(void *arg_)
>>>      static long long int send_ipv6_ra_time = LLONG_MAX;
>>>      /* Next GARP/RARP announcement in ms. */
>>>      static long long int send_garp_rarp_time = LLONG_MAX;
>>> +    static long long int send_arp_nd_time = LLONG_MAX;
>>>      /* Next multicast query (IGMP) in ms. */
>>>      static long long int send_mcast_query_time = LLONG_MAX;
>>>      static long long int svc_monitors_next_run_time = LLONG_MAX;
>>> @@ -4037,6 +4049,7 @@ pinctrl_handler(void *arg_)
>>>              if (may_inject_pkts()) {
>>>                  ovs_mutex_lock(&pinctrl_mutex);
>>>                  send_garp_rarp_run(swconn, &send_garp_rarp_time);
>>> +                send_arp_nd_run(swconn, &send_arp_nd_time);
>>>                  send_ipv6_ras(swconn, &send_ipv6_ra_time);
>>>                  send_ipv6_prefixd(swconn, &send_prefixd_time);
>>>                  send_mac_binding_buffered_pkts(swconn);
>>> @@ -4055,6 +4068,7 @@ pinctrl_handler(void *arg_)
>>>          rconn_recv_wait(swconn);
>>>          if (rconn_is_connected(swconn)) {
>>>              send_garp_rarp_wait(send_garp_rarp_time);
>>> +            send_arp_nd_wait(send_arp_nd_time);
>>>              ipv6_ra_wait(send_ipv6_ra_time);
>>>              ip_mcast_querier_wait(send_mcast_query_time);
>>>              svc_monitors_wait(svc_monitors_next_run_time);
>>> @@ -4151,6 +4165,7 @@ pinctrl_run(struct ovsdb_idl_txn *ovnsb_idl_txn,
>>>              const struct sbrec_service_monitor_table *svc_mon_table,
>>>              const struct sbrec_mac_binding_table *mac_binding_table,
>>>              const struct sbrec_bfd_table *bfd_table,
>>> +            const struct sbrec_ecmp_nexthop_table *ecmp_nh_table,
>>>              const struct ovsrec_bridge *br_int,
>>>              const struct sbrec_chassis *chassis,
>>>              const struct hmap *local_datapaths,
>>> @@ -4167,8 +4182,9 @@ pinctrl_run(struct ovsdb_idl_txn *ovnsb_idl_txn,
>>>                             sbrec_port_binding_by_key, chassis);
>>>      send_garp_rarp_prepare(ovnsb_idl_txn, sbrec_port_binding_by_datapath,
>>>                             sbrec_port_binding_by_name,
>>> -                           sbrec_mac_binding_by_lport_ip, br_int, chassis,
>>> -                           local_datapaths, active_tunnels, ovs_table);
>>> +                           sbrec_mac_binding_by_lport_ip, ecmp_nh_table,
>>> +                           br_int, chassis, local_datapaths, 
>>> active_tunnels,
>>> +                           ovs_table);
>>>      prepare_ipv6_ras(local_active_ports_ras, sbrec_port_binding_by_name);
>>>      prepare_ipv6_prefixd(ovnsb_idl_txn, sbrec_port_binding_by_name,
>>>                           local_active_ports_ipv6_pd, chassis,
>>> @@ -4702,6 +4718,7 @@ pinctrl_destroy(void)
>>>      latch_destroy(&pinctrl.pinctrl_thread_exit);
>>>      rconn_destroy(pinctrl.swconn);
>>>      destroy_send_garps_rarps();
>>> +    destroy_send_arps_nds();
>>>      destroy_ipv6_ras();
>>>      destroy_ipv6_prefixd();
>>>      destroy_buffered_packets_ctx();
>>> @@ -5029,7 +5046,8 @@ wait_put_mac_bindings(struct ovsdb_idl_txn 
>>> *ovnsb_idl_txn)
>>>   */
>>>  struct garp_rarp_data {
>>>      struct eth_addr ea;          /* Ethernet address of port. */
>>> -    ovs_be32 ipv4;               /* Ipv4 address of port. */
>>> +    struct in6_addr src_ip;      /* IP address of port. */
>>> +    struct in6_addr dst_ip;      /* Destination IP address */
>>>      long long int announce_time; /* Next announcement in ms. */
>>>      int backoff;                 /* Backoff timeout for the next
>>>                                    * announcement (in msecs). */
>>> @@ -5052,19 +5070,37 @@ destroy_send_garps_rarps(void)
>>>      shash_destroy_free_data(&send_garp_rarp_data);
>>>  }
>>>  
>>> +/* Contains ARPs data to be sent to track ECMP next-hop mac address. 
>>> Protected
>>> + * by pinctrl_mutex. */
>>> +static struct shash send_arp_nd_data;
>>> +
>>> +static void
>>> +init_send_arps_nds(void)
>>> +{
>>> +    shash_init(&send_arp_nd_data);
>>> +}
>>> +
>>> +static void
>>> +destroy_send_arps_nds(void)
>>> +{
>>> +    shash_destroy_free_data(&send_arp_nd_data);
>>> +}
>>> +
>>>  /* Runs with in the main ovn-controller thread context. */
>>>  static void
>>> -add_garp_rarp(const char *name, const struct eth_addr ea, ovs_be32 ip,
>>> -              uint32_t dp_key, uint32_t port_key)
>>> +add_garp_rarp(const char *name, const struct eth_addr ea,
>>> +              struct in6_addr src_ip, struct in6_addr dst_ip,
>>> +              uint32_t dp_key, uint32_t port_key, struct shash *shash)
>>>  {
>>>      struct garp_rarp_data *garp_rarp = xmalloc(sizeof *garp_rarp);
>>>      garp_rarp->ea = ea;
>>> -    garp_rarp->ipv4 = ip;
>>> +    garp_rarp->src_ip = src_ip;
>>> +    garp_rarp->dst_ip = dst_ip;
>>>      garp_rarp->announce_time = time_msec() + 1000;
>>>      garp_rarp->backoff = 1000; /* msec. */
>>>      garp_rarp->dp_key = dp_key;
>>>      garp_rarp->port_key = port_key;
>>> -    shash_add(&send_garp_rarp_data, name, garp_rarp);
>>> +    shash_add(shash, name, garp_rarp);
>>>  
>>>      /* Notify pinctrl_handler so that it can wakeup and process
>>>       * these GARP/RARP requests. */
>>> @@ -5113,9 +5149,10 @@ send_garp_rarp_update(struct ovsdb_idl_txn 
>>> *ovnsb_idl_txn,
>>>                      }
>>>                  } else if (ovnsb_idl_txn) {
>>>                      add_garp_rarp(name, laddrs->ea,
>>> -                                  laddrs->ipv4_addrs[i].addr,
>>> -                                  binding_rec->datapath->tunnel_key,
>>> -                                  binding_rec->tunnel_key);
>>> +                            
>>> in6_addr_mapped_ipv4(laddrs->ipv4_addrs[i].addr),
>>> +                            
>>> in6_addr_mapped_ipv4(laddrs->ipv4_addrs[i].addr),
>>> +                            binding_rec->datapath->tunnel_key,
>>> +                            binding_rec->tunnel_key, &send_garp_rarp_data);
>>>                      send_garp_locally(ovnsb_idl_txn,
>>>                                        sbrec_mac_binding_by_lport_ip,
>>>                                        local_datapaths, binding_rec, 
>>> laddrs->ea,
>>> @@ -5142,9 +5179,11 @@ send_garp_rarp_update(struct ovsdb_idl_txn 
>>> *ovnsb_idl_txn,
>>>                              garp_rarp->backoff = 1000; /* msec. */
>>>                          }
>>>                      } else {
>>> -                        add_garp_rarp(name, laddrs->ea,
>>> -                                      0, binding_rec->datapath->tunnel_key,
>>> -                                      binding_rec->tunnel_key);
>>> +                        struct in6_addr ip = {};
>>
>> Please avoid empty initializers.  See:
>> ef51c943a305 ("treewide: Avoid empty initializer")
> 
> ack, I will fix it
> 
>>
>> https://github.com/ovn-org/ovn/commit/ef51c943a305
>>
>>> +                        add_garp_rarp(name, laddrs->ea, ip, ip,
>>> +                                      binding_rec->datapath->tunnel_key,
>>> +                                      binding_rec->tunnel_key,
>>> +                                      &send_garp_rarp_data);
>>>                      }
>>>                      free(name);
>>>              }
>>> @@ -5182,10 +5221,10 @@ send_garp_rarp_update(struct ovsdb_idl_txn 
>>> *ovnsb_idl_txn,
>>>              ip = laddrs.ipv4_addrs[0].addr;
>>>          }
>>>  
>>> -        add_garp_rarp(binding_rec->logical_port,
>>> -                      laddrs.ea, ip,
>>> +        add_garp_rarp(binding_rec->logical_port, laddrs.ea,
>>> +                      in6_addr_mapped_ipv4(ip), in6_addr_mapped_ipv4(ip),
>>>                        binding_rec->datapath->tunnel_key,
>>> -                      binding_rec->tunnel_key);
>>> +                      binding_rec->tunnel_key, &send_garp_rarp_data);
>>>          if (ip) {
>>>              send_garp_locally(ovnsb_idl_txn, sbrec_mac_binding_by_lport_ip,
>>>                                local_datapaths, binding_rec, laddrs.ea, ip);
>>> @@ -5196,12 +5235,52 @@ send_garp_rarp_update(struct ovsdb_idl_txn 
>>> *ovnsb_idl_txn,
>>>      }
>>>  }
>>>  
>>> +/* Add or update a vif for which ARPs need to be announced. */
>>> +static void
>>> +send_arp_nd_update(const struct sbrec_port_binding *pb, const char 
>>> *nexthop,
>>> +                   long long int max_arp_timeout, bool continuous_arp_nd)
>>> +{
>>> +    volatile struct garp_rarp_data *e = shash_find_data(&send_arp_nd_data,
>>> +                                                        nexthop);
>>
>> volatile?  Why?
> 
> I just reused the approch in send_garp_rarp_update(). I will remove it.
> 
>>
>>> +    if (e) {
>>> +        e->dp_key = pb->datapath->tunnel_key;
>>> +        e->port_key = pb->tunnel_key;
>>> +        if (max_arp_timeout != arp_nd_max_timeout ||
>>> +            continuous_arp_nd != arp_nd_continuous) {
>>> +            /* reset backoff */
>>> +            e->announce_time = time_msec() + 1000;
>>> +            e->backoff = 1000; /* msec. */
>>
>> It's already the third place where we have this code that manually sets
>> announce_time and backoff, we should refactor this into a single
>> function that we reuse for all garp_rarp users.
> 
> ack, I will do.
> 
>>
>>> +        }
>>> +    } else {
>>> +        struct lport_addresses laddrs;
>>> +        if (!extract_lsp_addresses(pb->mac[0], &laddrs)) {
>>> +            return;
>>> +        }
>>> +
>>> +        if (laddrs.n_ipv4_addrs) {
>>> +            ovs_be32 dst_ip;
>>> +            inet_pton(AF_INET, nexthop, &dst_ip);
>>
>> ip_parse() and we should check the return value.
> 
> ack, I will fix it.
> 
>>
>>> +            add_garp_rarp(nexthop, laddrs.ea,
>>> +                          in6_addr_mapped_ipv4(laddrs.ipv4_addrs[0].addr),
>>> +                          in6_addr_mapped_ipv4(dst_ip),
>>> +                          pb->datapath->tunnel_key, pb->tunnel_key,
>>> +                          &send_arp_nd_data);
>>> +        } else if (laddrs.n_ipv6_addrs) {
>>> +            struct in6_addr dst_ip;
>>> +            inet_pton(AF_INET6, nexthop, &dst_ip);
>>
>> ipv6_parse() and we should check the return value.
> 
> ack, I will fix it.
> 
>>
>>> +            add_garp_rarp(nexthop, laddrs.ea, laddrs.ipv6_addrs[0].addr,
>>> +                          dst_ip, pb->datapath->tunnel_key, pb->tunnel_key,
>>> +                          &send_arp_nd_data);
>>> +        }
>>> +        destroy_lport_addresses(&laddrs);
>>> +    }
>>> +}
>>> +
>>>  /* Remove a vif from GARP announcements. */
>>>  static void
>>> -send_garp_rarp_delete(const char *lport)
>>> +send_garp_rarp_delete(struct shash *shash, const char *lport)
>>>  {
>>> -    struct garp_rarp_data *garp_rarp = shash_find_and_delete
>>> -                                       (&send_garp_rarp_data, lport);
>>> +    struct garp_rarp_data *garp_rarp = shash_find_and_delete(shash, lport);
>>>      free(garp_rarp);
>>>      notify_pinctrl_handler();
>>>  }
>>> @@ -5220,9 +5299,11 @@ send_garp_rarp(struct rconn *swconn, struct 
>>> garp_rarp_data *garp_rarp,
>>>      uint64_t packet_stub[128 / 8];
>>>      struct dp_packet packet;
>>>      dp_packet_use_stub(&packet, packet_stub, sizeof packet_stub);
>>> -    if (garp_rarp->ipv4) {
>>> +    ovs_be32 src_ipv4 = in6_addr_get_mapped_ipv4(&garp_rarp->src_ip);
>>> +    if (src_ipv4) {
>>>          compose_arp(&packet, ARP_OP_REQUEST, garp_rarp->ea, eth_addr_zero,
>>> -                    true, garp_rarp->ipv4, garp_rarp->ipv4);
>>> +                    true, src_ipv4,
>>> +                    in6_addr_get_mapped_ipv4(&garp_rarp->dst_ip));
>>>      } else {
>>>          compose_rarp(&packet, garp_rarp->ea);
>>>      }
>>> @@ -6524,6 +6605,25 @@ get_nat_addresses_and_keys(struct ovsdb_idl_index 
>>> *sbrec_port_binding_by_name,
>>>      }
>>>  }
>>>  
>>> +static void
>>> +get_local_ecmp_nexthop_map(
>>> +        const struct sbrec_ecmp_nexthop_table *ecmp_nh_table,
>>> +        struct ovsdb_idl_index *sbrec_port_binding_by_name,
>>> +        const struct sbrec_chassis *chassis,
>>> +        struct smap *local_ecmp_nexthop_map)
>>> +{
>>> +    const struct sbrec_ecmp_nexthop *sb_ecmp_nexthop;
>>> +    SBREC_ECMP_NEXTHOP_TABLE_FOR_EACH (sb_ecmp_nexthop, ecmp_nh_table) {
>>> +       const struct sbrec_port_binding *pb =
>>> +           lport_lookup_by_name(sbrec_port_binding_by_name,
>>> +                                sb_ecmp_nexthop->port);
>>> +       if (pb && !strcmp(pb->type, "l3gateway") && pb->chassis == chassis) 
>>> {
>>> +           smap_add_once(local_ecmp_nexthop_map, sb_ecmp_nexthop->nexthop,
>>> +                         sb_ecmp_nexthop->port);
>>> +       }
>>> +    }
>>> +}
>>> +
>>>  static void
>>>  send_garp_rarp_wait(long long int send_garp_rarp_time)
>>>  {
>>> @@ -6534,6 +6634,16 @@ send_garp_rarp_wait(long long int 
>>> send_garp_rarp_time)
>>>      }
>>>  }
>>>  
>>> +static void
>>> +send_arp_nd_wait(long long int send_arp_nd_time)
>>> +{
>>> +    /* Set the poll timer for next arp packet only if there is data to
>>> +     * be sent. */
>>> +    if (!shash_is_empty(&send_arp_nd_data)) {
>>> +        poll_timer_wait_until(send_arp_nd_time);
>>> +    }
>>> +}
>>
>> This is essentially identical to send_garp_rarp_wait().  Let's refactor?
> 
> ack, I will do
> 
>>
>>> +
>>>  /* Called with in the pinctrl_handler thread context. */
>>>  static void
>>>  send_garp_rarp_run(struct rconn *swconn, long long int 
>>> *send_garp_rarp_time)
>>> @@ -6556,6 +6666,86 @@ send_garp_rarp_run(struct rconn *swconn, long long 
>>> int *send_garp_rarp_time)
>>>      }
>>>  }
>>>  
>>> +static long long int
>>> +send_arp_nd(struct rconn *swconn, struct garp_rarp_data *garp_rarp,
>>> +            long long int current_time)
>>> +    OVS_REQUIRES(pinctrl_mutex)
>>> +{
>>
>> This whole function is almost identical to send_garp_rarp() - let's
>> avoid code duplication.  I think the only difference is the table we
>> inject the packet in:
>>
>> resubmit->table_id = OFTABLE_LOCAL_OUTPUT;
>>
>> vs
>>
>> resubmit->table_id = OFTABLE_LOG_INGRESS_PIPELINE;
>>
>> So, additional question:  why treat these ARP requests different than GARPs?
> 
> I guess we can probably resubmit even to OFTABLE_LOG_INGRESS_PIPELINE but it
> does not seem strictly necessary, isn't it? We can just save some cpu cycles,
> no?
> 

I don't think the CPU usage is a concern here.  What I'm actually asking
is: do we care about the fact that these self originated ARPs skip all
forwarding decisions (like policies in the ingress pipeline) while GARPs
don't.

Thanks,
Dumitru

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to