On Thu, Jul 14, 2022 at 11:02 AM Dumitru Ceara <[email protected]> wrote:

> On 7/14/22 10:15, Ales Musil wrote:
> > Hi,
> >
>
> Hi Ales,
>

Hi Dumitru,

posted v5.


>
> > thank you for the review, I have posted v4. See also my replies inline.
> >
> > On Wed, Jul 13, 2022 at 10:02 PM Dumitru Ceara <[email protected]>
> wrote:
> >
> >> On 7/13/22 14:36, Ales Musil wrote:
> >>> The ovn-controller had a race condition over MAC
> >>> binding table with other controllers. When multiple
> >>> controllers received GARP from single source usually
> >>> the one who was able to win the race put it into SB.
> >>> The others got transaction error which triggered
> >>> full recompute even if it's not needed.
> >>>
> >>> In order to reduce the chance of multiple controllers
> >>> trying to insert same row at the same time add slight
> >>> delay to the MAC binding processing. This delay
> >>> is random interval between 1-50 in ms. This greatly
> >>> reduces the chance that multiple controllers will try
> >>> to add MAC binding at exactly the same time. This applies
> >>> only to multicast ARP request which applies only to GARP
> >>> that is sent to broadcast address.
> >>>
> >>> Local testing with this delay vs without show significantly
> >>> reduced chance of hitting the transaction error.
> >>>
> >>> During the testing 10 GARPs was sent to two controllers
> >>> at the same time. Without proposed fix at least one controller
> >>> had multiple transaction errors present every test iteration.
> >>>
> >>> With the proposed fix the transaction error was reduced to a
> >>> single one when it happened which was usually in less than
> >>> 10% of the iterations.
> >>>
> >>> As was mentioned before the race condition can still happen,
> >>> but the chance is greatly reduced.
> >>>
> >>> Suggested-by: Daniel Alvarez Sanchez <[email protected]>
> >>> Signed-off-by: Ales Musil <[email protected]>
> >>> ---
> >>
> >> Hi Ales,
> >>
> >> Thanks for this revision!  In general it makes sense but I have some
> >> comments here and there.
> >>
> >>> v3: Rebase on top of current main.
> >>>     Change the delay to be per ARP request instead of
> >>>     single delay for everything. This allows the possibility
> >>>     to delay only multicast/broadcast AP as suggested by Han.
> >>> ---
> >>>  controller/mac-learn.c | 23 ++++++++++++++++++++++-
> >>>  controller/mac-learn.h |  7 ++++++-
> >>>  controller/pinctrl.c   | 33 ++++++++++++++++++++++++---------
> >>>  tests/ovn.at           |  7 +++++++
> >>>  4 files changed, 59 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/controller/mac-learn.c b/controller/mac-learn.c
> >>> index 27634dca8..b7b392bb1 100644
> >>> --- a/controller/mac-learn.c
> >>> +++ b/controller/mac-learn.c
> >>> @@ -20,12 +20,15 @@
> >>>  /* OpenvSwitch lib includes. */
> >>>  #include "openvswitch/vlog.h"
> >>>  #include "lib/packets.h"
> >>> +#include "lib/random.h"
> >>>  #include "lib/smap.h"
> >>> +#include "lib/timeval.h"
> >>>
> >>>  VLOG_DEFINE_THIS_MODULE(mac_learn);
> >>>
> >>>  #define MAX_MAC_BINDINGS 1000
> >>>  #define MAX_FDB_ENTRIES  1000
> >>> +#define MAX_MAC_BINDING_DELAY_MSEC 50
> >>>
> >>>  static size_t mac_binding_hash(uint32_t dp_key, uint32_t port_key,
> >>>                                 struct in6_addr *);
> >>> @@ -64,7 +67,7 @@ ovn_mac_bindings_destroy(struct hmap *mac_bindings)
> >>>  struct mac_binding *
> >>>  ovn_mac_binding_add(struct hmap *mac_bindings, uint32_t dp_key,
> >>>                      uint32_t port_key, struct in6_addr *ip,
> >>> -                    struct eth_addr mac)
> >>> +                    struct eth_addr mac, bool is_delayed)
> >>
> >> What if we pass 'int max_delay_ms' instead of 'bool is_delayed'?
> >>
> >
> > This kinda breaks the encapsulation a bit. Why would the code outside
> > need to know what is the max delay? I am still inclined to have it as a
> > bool.
> >
>
> It's arguable in my opinion.  The code outside (pinctrl) uses the delay
> value already.  Anyhow, it's probably fine to leave it as is.
>

I have encapsulated the check so the delay/commit_at is not exposed
outside.


>
> >
> >>
> >>>  {
> >>>      uint32_t hash = mac_binding_hash(dp_key, port_key, ip);
> >>>
> >>> @@ -79,6 +82,11 @@ ovn_mac_binding_add(struct hmap *mac_bindings,
> >> uint32_t dp_key,
> >>>          mb->dp_key = dp_key;
> >>>          mb->port_key = port_key;
> >>>          mb->ip = *ip;
> >>> +        mb->delay = 0;
> >>> +        mb->created = time_msec();
> >>> +        if (is_delayed) {
> >>> +            mb->delay = random_range(MAX_MAC_BINDING_DELAY_MSEC) + 1;
> >>> +        }
> >>
> >> This could become:
> >>
> >> mb->delay = max_delay_ms ? random_range(max_delay_ms) + 1 : 0;
> >>
> >
> > Done
> >
> >
> >>
> >>>          hmap_insert(mac_bindings, &mb->hmap_node, hash);
> >>>      }
> >>>      mb->mac = mac;
> >>> @@ -86,6 +94,19 @@ ovn_mac_binding_add(struct hmap *mac_bindings,
> >> uint32_t dp_key,
> >>>      return mb;
> >>>  }
> >>>
> >>> +long long
> >>> +ovn_mac_binding_shortest_delay(struct hmap *mac_bindings)
> >>> +{
> >>> +    long long min = LLONG_MAX;
> >>> +
> >>> +    struct mac_binding *mb;
> >>> +    HMAP_FOR_EACH (mb, hmap_node, mac_bindings) {
> >>> +        min = MIN(mb->delay, min);
> >>> +    }
> >>> +
> >>> +    return min;
> >>> +}
> >>
> >> We don't really need to return the shortest delay.  We can rewrite this
> as:
> >>
> >> void
> >> ovn_mac_binding_wait(struct hmap *mac_bindings)
> >> {
> >>     struct mac_binding *mb;
> >>     HMAP_FOR_EACH (mb, hmap_node, mac_bindings) {
> >>         poll_timer_wait(mb->delay);
> >>     }
> >> }
> >>
> >> and call it in wait_put_mac_bindings().
> >>
> >
> > That's actually a good trick, thanks.
> >
> >
> >>
> >>> +
> >>>  /* fdb functions. */
> >>>  void
> >>>  ovn_fdb_init(struct hmap *fdbs)
> >>> diff --git a/controller/mac-learn.h b/controller/mac-learn.h
> >>> index e7e8ba2d3..d306bfa5a 100644
> >>> --- a/controller/mac-learn.h
> >>> +++ b/controller/mac-learn.h
> >>> @@ -31,16 +31,21 @@ struct mac_binding {
> >>>
> >>>      /* Value. */
> >>>      struct eth_addr mac;
> >>> +
> >>> +    /* Delay for multicast ARP. */
> >>
> >> This doesn't seem accurate.  It's actually "delay before adding
> >> mac_binding to the SB".  It will be 0 for mac_bindings learned from
> >> unicast replies.
> >>
> >
> > Right, done.
> >
> >
> >>
> >>> +    long long created;
> >>> +    uint32_t delay;
> >>
> >> Let's add the units please, something like:
> >>
> >> long long created_ms;
> >> uint32_t delay_ms;
> >>
> >> Or maybe 'commit_delay_ms'?
> >>
> >
> > Done
> >
> >
> >>
> >>>  };
> >>>
> >>>  void ovn_mac_bindings_init(struct hmap *mac_bindings);
> >>>  void ovn_mac_bindings_flush(struct hmap *mac_bindings);
> >>>  void ovn_mac_bindings_destroy(struct hmap *mac_bindings);
> >>> +long long ovn_mac_binding_shortest_delay(struct hmap *mac_bindings);
> >>>
> >>>  struct mac_binding *ovn_mac_binding_add(struct hmap *mac_bindings,
> >>>                                          uint32_t dp_key, uint32_t
> >> port_key,
> >>>                                          struct in6_addr *ip,
> >>> -                                        struct eth_addr mac);
> >>> +                                        struct eth_addr mac, bool
> >> is_delayed);
> >>>
> >>>
> >>>
> >>> diff --git a/controller/pinctrl.c b/controller/pinctrl.c
> >>> index f954362b7..9c615ebd5 100644
> >>> --- a/controller/pinctrl.c
> >>> +++ b/controller/pinctrl.c
> >>> @@ -4134,9 +4134,13 @@ pinctrl_handle_put_mac_binding(const struct flow
> >> *md,
> >>>          memcpy(&ip_key, &ip6, sizeof ip_key);
> >>>      }
> >>>
> >>> +    /* If the ARP reply was unicast we should not delay it,
> >>> +     * there won't be any race. */
> >>> +    bool is_multicast = eth_addr_is_multicast(headers->dl_dst);
> >>>      struct mac_binding *mb = ovn_mac_binding_add(&put_mac_bindings,
> >> dp_key,
> >>>                                                   port_key, &ip_key,
> >>> -                                                 headers->dl_src);
> >>> +                                                 headers->dl_src,
> >>> +                                                 is_multicast);
> >>
> >> It's mainly a personal preference but I'd rewrite this as:
> >>
> >> uint32_t max_delay_ms = eth_addr_is_multicast(headers->dl_dst)
> >>                         ? MAX_MAC_BINDING_DELAY_MSEC
> >>                         : 0;
> >> truct mac_binding *mb = ovn_mac_binding_add(&put_mac_bindings, dp_key,
> >>                                             port_key, &ip_key,
> >>                                             headers->dl_src,
> >>                                             max_delay_ms);
> >>
> >>
> > Same as the above, I would not expose the max_delay unless there is
> benefit
> > of doing that.
> >
> >
>
> Ok, I'd at least change 'bool is_multicast' to 'bool is_unicast' to
> match the comment above it.
>
> It could be:
>
> bool is_unicast = !eth_addr_is_multicast(headers->dl_dst);
>

Done.


>
> >
> >>>      if (!mb) {
> >>>          COVERAGE_INC(pinctrl_drop_put_mac_binding);
> >>>          return;
> >>> @@ -4296,14 +4300,23 @@ run_put_mac_bindings(struct ovsdb_idl_txn
> >> *ovnsb_idl_txn,
> >>>          return;
> >>>      }
> >>>
> >>> -    const struct mac_binding *mb;
> >>> -    HMAP_FOR_EACH (mb, hmap_node, &put_mac_bindings) {
> >>> -        run_put_mac_binding(ovnsb_idl_txn,
> >> sbrec_datapath_binding_by_key,
> >>> -                            sbrec_port_binding_by_key,
> >>> -                            sbrec_mac_binding_by_lport_ip,
> >>> -                            mb);
> >>> +    long long now = time_msec();
> >>> +
> >>> +    struct mac_binding *mb;
> >>> +    HMAP_FOR_EACH_SAFE (mb, hmap_node, &put_mac_bindings) {
> >>> +            long long elapsed = now - mb->created;
> >>> +            if (elapsed >= mb->delay) {
> >>> +                run_put_mac_binding(ovnsb_idl_txn,
> >>> +                                    sbrec_datapath_binding_by_key,
> >>> +                                    sbrec_port_binding_by_key,
> >>> +                                    sbrec_mac_binding_by_lport_ip,
> >>> +                                    mb);
> >>> +                hmap_remove(&put_mac_bindings, &mb->hmap_node);
> >>> +                free(mb);
> >>
> >> We used to call ovn_mac_bindings_flush() to remove all elements, now we
> >> remove them in the caller.  I think this breaks a bit the encapsulation
> >> that mac-learn.[hc] tried to add.  Can we add a new api, e.g.,
> >> ovn_mac_binding_remove(), that will remove the entry from the map and
> >> free it?
> >>
> >
> > Makes sense, done.
> >
> >
> >>
> >> Now that we're here, what if the SB commit fails for some other
> >> unrelated reason?  We'll force a new ARP request for future packets.  It
> >> used to be like that before your patch too.  I'm just wondering if we
> >> need to flush 'put_mac_bindings' only after we know for sure that the SB
> >> record was created.  Definitely something that should be addressed (if
> >> we go that way) in a different patch though.
> >>
> >
> > We could investigate that, however I am afraid that it would be hard to
> > propagate
> > as the transaction error happens in ovsdb_idl_loop_commit_and_wait, in
> the
> > main context.
> >
>
> Ok, we can revisit this in the future if needed.
>
> >
> >>
> >>> +            } else {
> >>> +                mb->delay -= elapsed;
> >>
> >> Hmm, isn't this wrong?  We don't need the "else", do we?
> >>
> >
> > We actually do, consider having 2 mb waiting to be added one with 2 ms,
> > second with 5 ms.
> > Without removing what was elapsed the second would wait 7 ms instead,
> some
> > of them could
> > wait way longer than the max delay.
> >
>
> Hmm, ok, but then you need to update created_ms too.  Otherwise, next
> time we run this loop we will be potentially committing mac bindings too
> early.  What if instead of storing the commit_delay_ms we store the
> expected time to commit?  And instead of using poll_timer_wait() we
> could use poll_timer_wait_until().
>

Good suggestion, done.


>
> Thanks,
> Dumitru
>
>
Thanks,
Ales

-- 

Ales Musil

Senior Software Engineer - OVN Core

Red Hat EMEA <https://www.redhat.com>

[email protected]    IM: amusil
<https://red.ht/sig>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to