On 7/14/22 10:15, Ales Musil wrote:
> Hi,
> 

Hi Ales,

> thank you for the review, I have posted v4. See also my replies inline.
> 
> On Wed, Jul 13, 2022 at 10:02 PM Dumitru Ceara <[email protected]> wrote:
> 
>> On 7/13/22 14:36, Ales Musil wrote:
>>> The ovn-controller had a race condition over MAC
>>> binding table with other controllers. When multiple
>>> controllers received GARP from single source usually
>>> the one who was able to win the race put it into SB.
>>> The others got transaction error which triggered
>>> full recompute even if it's not needed.
>>>
>>> In order to reduce the chance of multiple controllers
>>> trying to insert same row at the same time add slight
>>> delay to the MAC binding processing. This delay
>>> is random interval between 1-50 in ms. This greatly
>>> reduces the chance that multiple controllers will try
>>> to add MAC binding at exactly the same time. This applies
>>> only to multicast ARP request which applies only to GARP
>>> that is sent to broadcast address.
>>>
>>> Local testing with this delay vs without show significantly
>>> reduced chance of hitting the transaction error.
>>>
>>> During the testing 10 GARPs was sent to two controllers
>>> at the same time. Without proposed fix at least one controller
>>> had multiple transaction errors present every test iteration.
>>>
>>> With the proposed fix the transaction error was reduced to a
>>> single one when it happened which was usually in less than
>>> 10% of the iterations.
>>>
>>> As was mentioned before the race condition can still happen,
>>> but the chance is greatly reduced.
>>>
>>> Suggested-by: Daniel Alvarez Sanchez <[email protected]>
>>> Signed-off-by: Ales Musil <[email protected]>
>>> ---
>>
>> Hi Ales,
>>
>> Thanks for this revision!  In general it makes sense but I have some
>> comments here and there.
>>
>>> v3: Rebase on top of current main.
>>>     Change the delay to be per ARP request instead of
>>>     single delay for everything. This allows the possibility
>>>     to delay only multicast/broadcast AP as suggested by Han.
>>> ---
>>>  controller/mac-learn.c | 23 ++++++++++++++++++++++-
>>>  controller/mac-learn.h |  7 ++++++-
>>>  controller/pinctrl.c   | 33 ++++++++++++++++++++++++---------
>>>  tests/ovn.at           |  7 +++++++
>>>  4 files changed, 59 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/controller/mac-learn.c b/controller/mac-learn.c
>>> index 27634dca8..b7b392bb1 100644
>>> --- a/controller/mac-learn.c
>>> +++ b/controller/mac-learn.c
>>> @@ -20,12 +20,15 @@
>>>  /* OpenvSwitch lib includes. */
>>>  #include "openvswitch/vlog.h"
>>>  #include "lib/packets.h"
>>> +#include "lib/random.h"
>>>  #include "lib/smap.h"
>>> +#include "lib/timeval.h"
>>>
>>>  VLOG_DEFINE_THIS_MODULE(mac_learn);
>>>
>>>  #define MAX_MAC_BINDINGS 1000
>>>  #define MAX_FDB_ENTRIES  1000
>>> +#define MAX_MAC_BINDING_DELAY_MSEC 50
>>>
>>>  static size_t mac_binding_hash(uint32_t dp_key, uint32_t port_key,
>>>                                 struct in6_addr *);
>>> @@ -64,7 +67,7 @@ ovn_mac_bindings_destroy(struct hmap *mac_bindings)
>>>  struct mac_binding *
>>>  ovn_mac_binding_add(struct hmap *mac_bindings, uint32_t dp_key,
>>>                      uint32_t port_key, struct in6_addr *ip,
>>> -                    struct eth_addr mac)
>>> +                    struct eth_addr mac, bool is_delayed)
>>
>> What if we pass 'int max_delay_ms' instead of 'bool is_delayed'?
>>
> 
> This kinda breaks the encapsulation a bit. Why would the code outside
> need to know what is the max delay? I am still inclined to have it as a
> bool.
> 

It's arguable in my opinion.  The code outside (pinctrl) uses the delay
value already.  Anyhow, it's probably fine to leave it as is.

> 
>>
>>>  {
>>>      uint32_t hash = mac_binding_hash(dp_key, port_key, ip);
>>>
>>> @@ -79,6 +82,11 @@ ovn_mac_binding_add(struct hmap *mac_bindings,
>> uint32_t dp_key,
>>>          mb->dp_key = dp_key;
>>>          mb->port_key = port_key;
>>>          mb->ip = *ip;
>>> +        mb->delay = 0;
>>> +        mb->created = time_msec();
>>> +        if (is_delayed) {
>>> +            mb->delay = random_range(MAX_MAC_BINDING_DELAY_MSEC) + 1;
>>> +        }
>>
>> This could become:
>>
>> mb->delay = max_delay_ms ? random_range(max_delay_ms) + 1 : 0;
>>
> 
> Done
> 
> 
>>
>>>          hmap_insert(mac_bindings, &mb->hmap_node, hash);
>>>      }
>>>      mb->mac = mac;
>>> @@ -86,6 +94,19 @@ ovn_mac_binding_add(struct hmap *mac_bindings,
>> uint32_t dp_key,
>>>      return mb;
>>>  }
>>>
>>> +long long
>>> +ovn_mac_binding_shortest_delay(struct hmap *mac_bindings)
>>> +{
>>> +    long long min = LLONG_MAX;
>>> +
>>> +    struct mac_binding *mb;
>>> +    HMAP_FOR_EACH (mb, hmap_node, mac_bindings) {
>>> +        min = MIN(mb->delay, min);
>>> +    }
>>> +
>>> +    return min;
>>> +}
>>
>> We don't really need to return the shortest delay.  We can rewrite this as:
>>
>> void
>> ovn_mac_binding_wait(struct hmap *mac_bindings)
>> {
>>     struct mac_binding *mb;
>>     HMAP_FOR_EACH (mb, hmap_node, mac_bindings) {
>>         poll_timer_wait(mb->delay);
>>     }
>> }
>>
>> and call it in wait_put_mac_bindings().
>>
> 
> That's actually a good trick, thanks.
> 
> 
>>
>>> +
>>>  /* fdb functions. */
>>>  void
>>>  ovn_fdb_init(struct hmap *fdbs)
>>> diff --git a/controller/mac-learn.h b/controller/mac-learn.h
>>> index e7e8ba2d3..d306bfa5a 100644
>>> --- a/controller/mac-learn.h
>>> +++ b/controller/mac-learn.h
>>> @@ -31,16 +31,21 @@ struct mac_binding {
>>>
>>>      /* Value. */
>>>      struct eth_addr mac;
>>> +
>>> +    /* Delay for multicast ARP. */
>>
>> This doesn't seem accurate.  It's actually "delay before adding
>> mac_binding to the SB".  It will be 0 for mac_bindings learned from
>> unicast replies.
>>
> 
> Right, done.
> 
> 
>>
>>> +    long long created;
>>> +    uint32_t delay;
>>
>> Let's add the units please, something like:
>>
>> long long created_ms;
>> uint32_t delay_ms;
>>
>> Or maybe 'commit_delay_ms'?
>>
> 
> Done
> 
> 
>>
>>>  };
>>>
>>>  void ovn_mac_bindings_init(struct hmap *mac_bindings);
>>>  void ovn_mac_bindings_flush(struct hmap *mac_bindings);
>>>  void ovn_mac_bindings_destroy(struct hmap *mac_bindings);
>>> +long long ovn_mac_binding_shortest_delay(struct hmap *mac_bindings);
>>>
>>>  struct mac_binding *ovn_mac_binding_add(struct hmap *mac_bindings,
>>>                                          uint32_t dp_key, uint32_t
>> port_key,
>>>                                          struct in6_addr *ip,
>>> -                                        struct eth_addr mac);
>>> +                                        struct eth_addr mac, bool
>> is_delayed);
>>>
>>>
>>>
>>> diff --git a/controller/pinctrl.c b/controller/pinctrl.c
>>> index f954362b7..9c615ebd5 100644
>>> --- a/controller/pinctrl.c
>>> +++ b/controller/pinctrl.c
>>> @@ -4134,9 +4134,13 @@ pinctrl_handle_put_mac_binding(const struct flow
>> *md,
>>>          memcpy(&ip_key, &ip6, sizeof ip_key);
>>>      }
>>>
>>> +    /* If the ARP reply was unicast we should not delay it,
>>> +     * there won't be any race. */
>>> +    bool is_multicast = eth_addr_is_multicast(headers->dl_dst);
>>>      struct mac_binding *mb = ovn_mac_binding_add(&put_mac_bindings,
>> dp_key,
>>>                                                   port_key, &ip_key,
>>> -                                                 headers->dl_src);
>>> +                                                 headers->dl_src,
>>> +                                                 is_multicast);
>>
>> It's mainly a personal preference but I'd rewrite this as:
>>
>> uint32_t max_delay_ms = eth_addr_is_multicast(headers->dl_dst)
>>                         ? MAX_MAC_BINDING_DELAY_MSEC
>>                         : 0;
>> truct mac_binding *mb = ovn_mac_binding_add(&put_mac_bindings, dp_key,
>>                                             port_key, &ip_key,
>>                                             headers->dl_src,
>>                                             max_delay_ms);
>>
>>
> Same as the above, I would not expose the max_delay unless there is benefit
> of doing that.
> 
> 

Ok, I'd at least change 'bool is_multicast' to 'bool is_unicast' to
match the comment above it.

It could be:

bool is_unicast = !eth_addr_is_multicast(headers->dl_dst);

> 
>>>      if (!mb) {
>>>          COVERAGE_INC(pinctrl_drop_put_mac_binding);
>>>          return;
>>> @@ -4296,14 +4300,23 @@ run_put_mac_bindings(struct ovsdb_idl_txn
>> *ovnsb_idl_txn,
>>>          return;
>>>      }
>>>
>>> -    const struct mac_binding *mb;
>>> -    HMAP_FOR_EACH (mb, hmap_node, &put_mac_bindings) {
>>> -        run_put_mac_binding(ovnsb_idl_txn,
>> sbrec_datapath_binding_by_key,
>>> -                            sbrec_port_binding_by_key,
>>> -                            sbrec_mac_binding_by_lport_ip,
>>> -                            mb);
>>> +    long long now = time_msec();
>>> +
>>> +    struct mac_binding *mb;
>>> +    HMAP_FOR_EACH_SAFE (mb, hmap_node, &put_mac_bindings) {
>>> +            long long elapsed = now - mb->created;
>>> +            if (elapsed >= mb->delay) {
>>> +                run_put_mac_binding(ovnsb_idl_txn,
>>> +                                    sbrec_datapath_binding_by_key,
>>> +                                    sbrec_port_binding_by_key,
>>> +                                    sbrec_mac_binding_by_lport_ip,
>>> +                                    mb);
>>> +                hmap_remove(&put_mac_bindings, &mb->hmap_node);
>>> +                free(mb);
>>
>> We used to call ovn_mac_bindings_flush() to remove all elements, now we
>> remove them in the caller.  I think this breaks a bit the encapsulation
>> that mac-learn.[hc] tried to add.  Can we add a new api, e.g.,
>> ovn_mac_binding_remove(), that will remove the entry from the map and
>> free it?
>>
> 
> Makes sense, done.
> 
> 
>>
>> Now that we're here, what if the SB commit fails for some other
>> unrelated reason?  We'll force a new ARP request for future packets.  It
>> used to be like that before your patch too.  I'm just wondering if we
>> need to flush 'put_mac_bindings' only after we know for sure that the SB
>> record was created.  Definitely something that should be addressed (if
>> we go that way) in a different patch though.
>>
> 
> We could investigate that, however I am afraid that it would be hard to
> propagate
> as the transaction error happens in ovsdb_idl_loop_commit_and_wait, in the
> main context.
> 

Ok, we can revisit this in the future if needed.

> 
>>
>>> +            } else {
>>> +                mb->delay -= elapsed;
>>
>> Hmm, isn't this wrong?  We don't need the "else", do we?
>>
> 
> We actually do, consider having 2 mb waiting to be added one with 2 ms,
> second with 5 ms.
> Without removing what was elapsed the second would wait 7 ms instead, some
> of them could
> wait way longer than the max delay.
> 

Hmm, ok, but then you need to update created_ms too.  Otherwise, next
time we run this loop we will be potentially committing mac bindings too
early.  What if instead of storing the commit_delay_ms we store the
expected time to commit?  And instead of using poll_timer_wait() we
could use poll_timer_wait_until().

Thanks,
Dumitru

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to