On 7/14/22 10:15, Ales Musil wrote: > Hi, > Hi Ales,
> thank you for the review, I have posted v4. See also my replies inline. > > On Wed, Jul 13, 2022 at 10:02 PM Dumitru Ceara <[email protected]> wrote: > >> On 7/13/22 14:36, Ales Musil wrote: >>> The ovn-controller had a race condition over MAC >>> binding table with other controllers. When multiple >>> controllers received GARP from single source usually >>> the one who was able to win the race put it into SB. >>> The others got transaction error which triggered >>> full recompute even if it's not needed. >>> >>> In order to reduce the chance of multiple controllers >>> trying to insert same row at the same time add slight >>> delay to the MAC binding processing. This delay >>> is random interval between 1-50 in ms. This greatly >>> reduces the chance that multiple controllers will try >>> to add MAC binding at exactly the same time. This applies >>> only to multicast ARP request which applies only to GARP >>> that is sent to broadcast address. >>> >>> Local testing with this delay vs without show significantly >>> reduced chance of hitting the transaction error. >>> >>> During the testing 10 GARPs was sent to two controllers >>> at the same time. Without proposed fix at least one controller >>> had multiple transaction errors present every test iteration. >>> >>> With the proposed fix the transaction error was reduced to a >>> single one when it happened which was usually in less than >>> 10% of the iterations. >>> >>> As was mentioned before the race condition can still happen, >>> but the chance is greatly reduced. >>> >>> Suggested-by: Daniel Alvarez Sanchez <[email protected]> >>> Signed-off-by: Ales Musil <[email protected]> >>> --- >> >> Hi Ales, >> >> Thanks for this revision! In general it makes sense but I have some >> comments here and there. >> >>> v3: Rebase on top of current main. >>> Change the delay to be per ARP request instead of >>> single delay for everything. This allows the possibility >>> to delay only multicast/broadcast AP as suggested by Han. >>> --- >>> controller/mac-learn.c | 23 ++++++++++++++++++++++- >>> controller/mac-learn.h | 7 ++++++- >>> controller/pinctrl.c | 33 ++++++++++++++++++++++++--------- >>> tests/ovn.at | 7 +++++++ >>> 4 files changed, 59 insertions(+), 11 deletions(-) >>> >>> diff --git a/controller/mac-learn.c b/controller/mac-learn.c >>> index 27634dca8..b7b392bb1 100644 >>> --- a/controller/mac-learn.c >>> +++ b/controller/mac-learn.c >>> @@ -20,12 +20,15 @@ >>> /* OpenvSwitch lib includes. */ >>> #include "openvswitch/vlog.h" >>> #include "lib/packets.h" >>> +#include "lib/random.h" >>> #include "lib/smap.h" >>> +#include "lib/timeval.h" >>> >>> VLOG_DEFINE_THIS_MODULE(mac_learn); >>> >>> #define MAX_MAC_BINDINGS 1000 >>> #define MAX_FDB_ENTRIES 1000 >>> +#define MAX_MAC_BINDING_DELAY_MSEC 50 >>> >>> static size_t mac_binding_hash(uint32_t dp_key, uint32_t port_key, >>> struct in6_addr *); >>> @@ -64,7 +67,7 @@ ovn_mac_bindings_destroy(struct hmap *mac_bindings) >>> struct mac_binding * >>> ovn_mac_binding_add(struct hmap *mac_bindings, uint32_t dp_key, >>> uint32_t port_key, struct in6_addr *ip, >>> - struct eth_addr mac) >>> + struct eth_addr mac, bool is_delayed) >> >> What if we pass 'int max_delay_ms' instead of 'bool is_delayed'? >> > > This kinda breaks the encapsulation a bit. Why would the code outside > need to know what is the max delay? I am still inclined to have it as a > bool. > It's arguable in my opinion. The code outside (pinctrl) uses the delay value already. Anyhow, it's probably fine to leave it as is. > >> >>> { >>> uint32_t hash = mac_binding_hash(dp_key, port_key, ip); >>> >>> @@ -79,6 +82,11 @@ ovn_mac_binding_add(struct hmap *mac_bindings, >> uint32_t dp_key, >>> mb->dp_key = dp_key; >>> mb->port_key = port_key; >>> mb->ip = *ip; >>> + mb->delay = 0; >>> + mb->created = time_msec(); >>> + if (is_delayed) { >>> + mb->delay = random_range(MAX_MAC_BINDING_DELAY_MSEC) + 1; >>> + } >> >> This could become: >> >> mb->delay = max_delay_ms ? random_range(max_delay_ms) + 1 : 0; >> > > Done > > >> >>> hmap_insert(mac_bindings, &mb->hmap_node, hash); >>> } >>> mb->mac = mac; >>> @@ -86,6 +94,19 @@ ovn_mac_binding_add(struct hmap *mac_bindings, >> uint32_t dp_key, >>> return mb; >>> } >>> >>> +long long >>> +ovn_mac_binding_shortest_delay(struct hmap *mac_bindings) >>> +{ >>> + long long min = LLONG_MAX; >>> + >>> + struct mac_binding *mb; >>> + HMAP_FOR_EACH (mb, hmap_node, mac_bindings) { >>> + min = MIN(mb->delay, min); >>> + } >>> + >>> + return min; >>> +} >> >> We don't really need to return the shortest delay. We can rewrite this as: >> >> void >> ovn_mac_binding_wait(struct hmap *mac_bindings) >> { >> struct mac_binding *mb; >> HMAP_FOR_EACH (mb, hmap_node, mac_bindings) { >> poll_timer_wait(mb->delay); >> } >> } >> >> and call it in wait_put_mac_bindings(). >> > > That's actually a good trick, thanks. > > >> >>> + >>> /* fdb functions. */ >>> void >>> ovn_fdb_init(struct hmap *fdbs) >>> diff --git a/controller/mac-learn.h b/controller/mac-learn.h >>> index e7e8ba2d3..d306bfa5a 100644 >>> --- a/controller/mac-learn.h >>> +++ b/controller/mac-learn.h >>> @@ -31,16 +31,21 @@ struct mac_binding { >>> >>> /* Value. */ >>> struct eth_addr mac; >>> + >>> + /* Delay for multicast ARP. */ >> >> This doesn't seem accurate. It's actually "delay before adding >> mac_binding to the SB". It will be 0 for mac_bindings learned from >> unicast replies. >> > > Right, done. > > >> >>> + long long created; >>> + uint32_t delay; >> >> Let's add the units please, something like: >> >> long long created_ms; >> uint32_t delay_ms; >> >> Or maybe 'commit_delay_ms'? >> > > Done > > >> >>> }; >>> >>> void ovn_mac_bindings_init(struct hmap *mac_bindings); >>> void ovn_mac_bindings_flush(struct hmap *mac_bindings); >>> void ovn_mac_bindings_destroy(struct hmap *mac_bindings); >>> +long long ovn_mac_binding_shortest_delay(struct hmap *mac_bindings); >>> >>> struct mac_binding *ovn_mac_binding_add(struct hmap *mac_bindings, >>> uint32_t dp_key, uint32_t >> port_key, >>> struct in6_addr *ip, >>> - struct eth_addr mac); >>> + struct eth_addr mac, bool >> is_delayed); >>> >>> >>> >>> diff --git a/controller/pinctrl.c b/controller/pinctrl.c >>> index f954362b7..9c615ebd5 100644 >>> --- a/controller/pinctrl.c >>> +++ b/controller/pinctrl.c >>> @@ -4134,9 +4134,13 @@ pinctrl_handle_put_mac_binding(const struct flow >> *md, >>> memcpy(&ip_key, &ip6, sizeof ip_key); >>> } >>> >>> + /* If the ARP reply was unicast we should not delay it, >>> + * there won't be any race. */ >>> + bool is_multicast = eth_addr_is_multicast(headers->dl_dst); >>> struct mac_binding *mb = ovn_mac_binding_add(&put_mac_bindings, >> dp_key, >>> port_key, &ip_key, >>> - headers->dl_src); >>> + headers->dl_src, >>> + is_multicast); >> >> It's mainly a personal preference but I'd rewrite this as: >> >> uint32_t max_delay_ms = eth_addr_is_multicast(headers->dl_dst) >> ? MAX_MAC_BINDING_DELAY_MSEC >> : 0; >> truct mac_binding *mb = ovn_mac_binding_add(&put_mac_bindings, dp_key, >> port_key, &ip_key, >> headers->dl_src, >> max_delay_ms); >> >> > Same as the above, I would not expose the max_delay unless there is benefit > of doing that. > > Ok, I'd at least change 'bool is_multicast' to 'bool is_unicast' to match the comment above it. It could be: bool is_unicast = !eth_addr_is_multicast(headers->dl_dst); > >>> if (!mb) { >>> COVERAGE_INC(pinctrl_drop_put_mac_binding); >>> return; >>> @@ -4296,14 +4300,23 @@ run_put_mac_bindings(struct ovsdb_idl_txn >> *ovnsb_idl_txn, >>> return; >>> } >>> >>> - const struct mac_binding *mb; >>> - HMAP_FOR_EACH (mb, hmap_node, &put_mac_bindings) { >>> - run_put_mac_binding(ovnsb_idl_txn, >> sbrec_datapath_binding_by_key, >>> - sbrec_port_binding_by_key, >>> - sbrec_mac_binding_by_lport_ip, >>> - mb); >>> + long long now = time_msec(); >>> + >>> + struct mac_binding *mb; >>> + HMAP_FOR_EACH_SAFE (mb, hmap_node, &put_mac_bindings) { >>> + long long elapsed = now - mb->created; >>> + if (elapsed >= mb->delay) { >>> + run_put_mac_binding(ovnsb_idl_txn, >>> + sbrec_datapath_binding_by_key, >>> + sbrec_port_binding_by_key, >>> + sbrec_mac_binding_by_lport_ip, >>> + mb); >>> + hmap_remove(&put_mac_bindings, &mb->hmap_node); >>> + free(mb); >> >> We used to call ovn_mac_bindings_flush() to remove all elements, now we >> remove them in the caller. I think this breaks a bit the encapsulation >> that mac-learn.[hc] tried to add. Can we add a new api, e.g., >> ovn_mac_binding_remove(), that will remove the entry from the map and >> free it? >> > > Makes sense, done. > > >> >> Now that we're here, what if the SB commit fails for some other >> unrelated reason? We'll force a new ARP request for future packets. It >> used to be like that before your patch too. I'm just wondering if we >> need to flush 'put_mac_bindings' only after we know for sure that the SB >> record was created. Definitely something that should be addressed (if >> we go that way) in a different patch though. >> > > We could investigate that, however I am afraid that it would be hard to > propagate > as the transaction error happens in ovsdb_idl_loop_commit_and_wait, in the > main context. > Ok, we can revisit this in the future if needed. > >> >>> + } else { >>> + mb->delay -= elapsed; >> >> Hmm, isn't this wrong? We don't need the "else", do we? >> > > We actually do, consider having 2 mb waiting to be added one with 2 ms, > second with 5 ms. > Without removing what was elapsed the second would wait 7 ms instead, some > of them could > wait way longer than the max delay. > Hmm, ok, but then you need to update created_ms too. Otherwise, next time we run this loop we will be potentially committing mac bindings too early. What if instead of storing the commit_delay_ms we store the expected time to commit? And instead of using poll_timer_wait() we could use poll_timer_wait_until(). Thanks, Dumitru _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
