On Fri, Aug 11, 2023 at 6:23 AM Dumitru Ceara <dce...@redhat.com> wrote: > > On 8/1/23 12:51, Han Zhou wrote: > > On Tue, Jul 18, 2023 at 6:16 AM Dumitru Ceara <dce...@redhat.com> wrote: > >> > >> On 7/18/23 12:14, Han Zhou wrote: > >>> On Mon, Jul 17, 2023 at 9:51 PM Dumitru Ceara <dce...@redhat.com> wrote: > >>>> > >>>> On 7/14/23 08:41, Ales Musil wrote: > >>>>> On Thu, Jul 13, 2023 at 4:39 PM Dumitru Ceara <dce...@redhat.com> > > wrote: > >>>>> > >>>>>> Regular dp-hash is not a canonical L4 hash (at least with the netlink > >>>>>> datapath). If the datapath supports l4 symmetrical dp-hash use that > >>> one > >>>>>> instead. > >>>>>> > >>>>>> Reported-at: https://github.com/ovn-org/ovn/issues/112 > >>>>>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2188679 > >>>>>> Signed-off-by: Dumitru Ceara <dce...@redhat.com> > >>>>>> > >>>>> > >>>>> Hi Dumitru, > >>>>> > >>>>> > >>>>>> --- > >>>>>> include/ovn/features.h | 2 ++ > >>>>>> lib/actions.c | 6 ++++++ > >>>>>> lib/features.c | 49 > > +++++++++++++++++++++++++++++++++--------- > >>>>>> 3 files changed, 47 insertions(+), 10 deletions(-) > >>>>>> > >>>>>> diff --git a/include/ovn/features.h b/include/ovn/features.h > >>>>>> index de8f1f5489..3bf536127f 100644 > >>>>>> --- a/include/ovn/features.h > >>>>>> +++ b/include/ovn/features.h > >>>>>> @@ -34,12 +34,14 @@ enum ovs_feature_support_bits { > >>>>>> OVS_CT_ZERO_SNAT_SUPPORT_BIT, > >>>>>> OVS_DP_METER_SUPPORT_BIT, > >>>>>> OVS_CT_TUPLE_FLUSH_BIT, > >>>>>> + OVS_DP_HASH_L4_SYM_BIT, > >>>>>> }; > >>>>>> > >>>>>> enum ovs_feature_value { > >>>>>> OVS_CT_ZERO_SNAT_SUPPORT = (1 << OVS_CT_ZERO_SNAT_SUPPORT_BIT), > >>>>>> OVS_DP_METER_SUPPORT = (1 << OVS_DP_METER_SUPPORT_BIT), > >>>>>> OVS_CT_TUPLE_FLUSH_SUPPORT = (1 << OVS_CT_TUPLE_FLUSH_BIT), > >>>>>> + OVS_DP_HASH_L4_SYM_SUPPORT = (1 << OVS_DP_HASH_L4_SYM_BIT), > >>>>>> }; > >>>>>> > >>>>>> void ovs_feature_support_destroy(void); > >>>>>> diff --git a/lib/actions.c b/lib/actions.c > >>>>>> index 037172e606..9d52cb75a8 100644 > >>>>>> --- a/lib/actions.c > >>>>>> +++ b/lib/actions.c > >>>>>> @@ -1625,6 +1625,12 @@ encode_SELECT(const struct ovnact_select > >>> *select, > >>>>>> struct ds ds = DS_EMPTY_INITIALIZER; > >>>>>> ds_put_format(&ds, "type=select,selection_method=dp_hash"); > >>>>>> > >>>>>> + if (ovs_feature_is_supported(OVS_DP_HASH_L4_SYM_SUPPORT)) { > >>>>>> + /* Select dp-hash l4_symmetric by setting the upper 32bits > > of > >>>>>> + * selection_method_param to 1: */ > >>>>>> > >>>>> > >>>>> This comment is a bit unfortunate, because it reads like you want to > > set > >>>>> all bits of the upper half > >>>>> to 1 e.g. 0xffffffff. Maybe change it to: "selection_method_param to > >>> value > >>>>> 1 (1 << 32)." during merge, wdyt? > >>>>> > >>>>> > >>>> > >>>> Makes sense, thanks for the suggestion! I changed it accordingly. > >>>> > >>>>>> + ds_put_cstr(&ds, ",selection_method_param=0x100000000"); > >>>>>> + } > >>>>>> + > >>>>>> struct mf_subfield sf = expr_resolve_field(&select->res_field); > >>>>>> > >>>>>> for (size_t bucket_id = 0; bucket_id < select->n_dsts; > >>> bucket_id++) { > >>>>>> diff --git a/lib/features.c b/lib/features.c > >>>>>> index 97c9c86f00..d24e8f6c5c 100644 > >>>>>> --- a/lib/features.c > >>>>>> +++ b/lib/features.c > >>>>>> @@ -21,6 +21,7 @@ > >>>>>> #include "lib/dirs.h" > >>>>>> #include "socket-util.h" > >>>>>> #include "lib/vswitch-idl.h" > >>>>>> +#include "odp-netlink.h" > >>>>>> #include "openvswitch/vlog.h" > >>>>>> #include "openvswitch/ofpbuf.h" > >>>>>> #include "openvswitch/rconn.h" > >>>>>> @@ -33,20 +34,48 @@ VLOG_DEFINE_THIS_MODULE(features); > >>>>>> > >>>>>> #define FEATURES_DEFAULT_PROBE_INTERVAL_SEC 5 > >>>>>> > >>>>>> +/* Parses 'cap_name' from 'ovs_capabilities' and returns whether the > >>>>>> + * type of capability is supported or not. */ > >>>>>> +typedef bool ovs_feature_parse_func(const struct smap > >>> *ovs_capabilities, > >>>>>> + const char *cap_name); > >>>>>> + > >>>>>> struct ovs_feature { > >>>>>> enum ovs_feature_value value; > >>>>>> const char *name; > >>>>>> + ovs_feature_parse_func *parse; > >>>>>> }; > >>>>>> > >>>>>> +static bool > >>>>>> +bool_parser(const struct smap *ovs_capabilities, const char > > *cap_name) > >>>>>> +{ > >>>>>> + return smap_get_bool(ovs_capabilities, cap_name, false); > >>>>>> +} > >>>>>> + > >>>>>> +static bool > >>>>>> +dp_hash_l4_sym_support_parser(const struct smap *ovs_capabilities, > >>>>>> + const char *cap_name OVS_UNUSED) > >>>>>> +{ > >>>>>> + int max_hash_alg = smap_get_int(ovs_capabilities, > > "max_hash_alg", > >>> 0); > >>>>>> + > >>>>>> + return max_hash_alg == OVS_HASH_ALG_SYM_L4; > >>>>>> +} > >>>>>> + > >>>>>> static struct ovs_feature all_ovs_features[] = { > >>>>>> { > >>>>>> .value = OVS_CT_ZERO_SNAT_SUPPORT, > >>>>>> - .name = "ct_zero_snat" > >>>>>> + .name = "ct_zero_snat", > >>>>>> + .parse = bool_parser, > >>>>>> }, > >>>>>> { > >>>>>> .value = OVS_CT_TUPLE_FLUSH_SUPPORT, > >>>>>> - .name = "ct_flush" > >>>>>> - } > >>>>>> + .name = "ct_flush", > >>>>>> + .parse = bool_parser, > >>>>>> + }, > >>>>>> + { > >>>>>> + .value = OVS_DP_HASH_L4_SYM_SUPPORT, > >>>>>> + .name = "dp_hash_l4_sym_support", > >>>>>> + .parse = dp_hash_l4_sym_support_parser, > >>>>>> + }, > >>>>>> }; > >>>>>> > >>>>>> /* A bitmap of OVS features that have been detected as 'supported'. > > */ > >>>>>> @@ -65,6 +94,7 @@ ovs_feature_is_valid(enum ovs_feature_value > > feature) > >>>>>> case OVS_CT_ZERO_SNAT_SUPPORT: > >>>>>> case OVS_DP_METER_SUPPORT: > >>>>>> case OVS_CT_TUPLE_FLUSH_SUPPORT: > >>>>>> + case OVS_DP_HASH_L4_SYM_SUPPORT: > >>>>>> return true; > >>>>>> default: > >>>>>> return false; > >>>>>> @@ -183,18 +213,17 @@ ovs_feature_support_run(const struct smap > >>>>>> *ovs_capabilities, > >>>>>> } > >>>>>> > >>>>>> for (size_t i = 0; i < ARRAY_SIZE(all_ovs_features); i++) { > >>>>>> - enum ovs_feature_value value = all_ovs_features[i].value; > >>>>>> - const char *name = all_ovs_features[i].name; > >>>>>> - bool old_state = supported_ovs_features & value; > >>>>>> - bool new_state = smap_get_bool(ovs_capabilities, name, > > false); > >>>>>> + struct ovs_feature *feature = &all_ovs_features[i]; > >>>>>> + bool old_state = supported_ovs_features & feature->value; > >>>>>> + bool new_state = feature->parse(ovs_capabilities, > >>> feature->name); > >>>>>> if (new_state != old_state) { > >>>>>> updated = true; > >>>>>> if (new_state) { > >>>>>> - supported_ovs_features |= value; > >>>>>> + supported_ovs_features |= feature->value; > >>>>>> } else { > >>>>>> - supported_ovs_features &= ~value; > >>>>>> + supported_ovs_features &= ~feature->value; > >>>>>> } > >>>>>> - VLOG_INFO_RL(&rl, "OVS Feature: %s, state: %s", name, > >>>>>> + VLOG_INFO_RL(&rl, "OVS Feature: %s, state: %s", > >>> feature->name, > >>>>>> new_state ? "supported" : "not supported"); > >>>>>> } > >>>>>> } > >>>>>> -- > >>>>>> 2.31.1 > >>>>>> > >>>>>> _______________________________________________ > >>>>>> dev mailing list > >>>>>> d...@openvswitch.org > >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>>>>> > >>>>>> > >>>>> Other than that it looks good. > >>>>> > >>>>> Acked-by: Ales Musil <amu...@redhat.com> > >>>>> > >>>> > >>>> Thanks, I applied this to main. > >>>> > >>>> I was wondering if we should backport this though. It's not exactly a > >>>> bug fix but it does avoid using a broken datapath feature (when > >>>> possible) so it avoids a broken behavior. > >>>> > >>>> The changes are quite contained. > >>>> > >>>> CC-ing some of the maintainers explicitly to get some more eyes on > > this. > >>>> > >>> Thanks Dumitru. Sorry that I was following the reported issue. I tried > > to > >>> look for some more details, but I am not authorized to access bug > > 2175716. > >> > >> Sorry about that, that bug used to be public but some internal policies > >> made it private. > >> > >>> It is not quite clear to me what the impact of this problem is. The OVN > >>> issue (112) seems to be a very common configuration of OVN ECMP and I > >>> assume the problem is not easily reproduced because I have been > >>> testing/deploying the feature and it worked as expected. Do you have > > more > >>> details about the trigger? Are there any related fixes in OVS and > > kernel? > >>> > >> > >> The original bug report came from OpenShift: > >> https://issues.redhat.com/browse/OCPBUGS-7406 > >> > >> CC-ing Patryk as he root caused the original issue. > >> > >> The simplified topology is: > >> +--- GW1 ---+ > >> client ---- cluster-router ---+ +--- (out of OVN) --- server > >> +--- GW2 ---+ > >> > >> On cluster-router there's an ECMP route to reach 'server' via: > >> - GW1 > >> - GW2 > >> > >> On GW1 traffic going out of OVN is SNATed to IP1. > >> On GW2 traffic going out of OVN is SNATed to IP2. > >> > >> The client initiates a TCP session towards the server; a path is chosen > >> based on dp-hash, e.g. GW2. > >> > >> From this point on, during normal operation, traffic on this session > >> flows fine via GW2. > >> > >> If the client closes the connection, then for the closing connection > >> there's an exception when computing the dp-hash if the socket is in > >> state TIME_WAIT, substate TCP_FIN_WAIT_2: > >> https://elixir.bootlin.com/linux/v6.2.1/source/net/ipv4/tcp_ipv4.c#L2218 > >> In this last case we call: > >> > >> tcp_v4_timewait_ack(sk, skb) > >> | > >> tcp_v4_send_ack(sk, skb, ...) > >> > >> The latter uses a global, per-cpu "ctl_sk" to build the last ACK packet: > >> https://elixir.bootlin.com/linux/v6.2.1/source/net/ipv4/tcp_ipv4.c#L925 > >> > >> ctl_sk = this_cpu_read(ipv4_tcp_sk); > >> sock_net_set(ctl_sk, net); > >> ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ? > >> inet_twsk(sk)->tw_mark : sk->sk_mark; > >> ctl_sk->sk_priority = (sk->sk_state == TCP_TIME_WAIT) ? > >> inet_twsk(sk)->tw_priority : sk->sk_priority; > >> transmit_time = tcp_transmit_time(sk); > >> ip_send_unicast_reply(ctl_sk, > >> skb, &TCP_SKB_CB(skb)->header.h4.opt, > >> ip_hdr(skb)->saddr, ip_hdr(skb)->daddr, > >> &arg, arg.iov[0].iov_len, > >> transmit_time); > >> > >> As ctl_sk's sk_txhash is 0, when transmitting this last ack, its hash is > >> computed from the packet contents: > >> > > https://elixir.bootlin.com/linux/v6.2.1/source/include/linux/skbuff.h#L1540 > >> > >> static inline __u32 skb_get_hash(struct sk_buff *skb) > >> { > >> if (!skb->l4_hash && !skb->sw_hash) > >> __skb_get_hash(skb); > >> > >> return skb->hash; > >> } > >> [...] > >> void __skb_get_hash(struct sk_buff *skb) > >> { > >> struct flow_keys keys; > >> u32 hash; > >> > >> __flow_hash_secret_init(); > >> > >> hash = ___skb_get_hash(skb, &keys, &hashrnd); > >> > >> __skb_set_sw_hash(skb, hash, flow_keys_have_l4(&keys)); > >> } > >> > >> But the original socket's txhash was actually a random hash: > >> https://elixir.bootlin.com/linux/v6.2.1/source/include/net/sock.h#L2128 > >> > >> static inline void sk_set_txhash(struct sock *sk) > >> { > >> /* This pairs with READ_ONCE() in skb_set_hash_from_sk() */ > >> WRITE_ONCE(sk->sk_txhash, net_tx_rndhash()); > >> } > >> > >> This means that the last ack might enter the ovs datapath with a > >> different dp_hash than all other packets sent on that session in that > >> direction. It's possible that this last ack gets hashed to a different > >> path of the ECMP route. > >> > >> This specific issue was fixed in the kernel through: > >> > >> > > https://lore.kernel.org/all/20230511093456.672221-1-aten...@kernel.org/T/#mc84997c4cc755354e5527b41b7a15869536a21f4 > >> > >> However, while debugging we realized the problem is actually visible > >> with _any_ retransmit. That's because a retransmit will trigger a rehash: > >> > >> https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_output.c#L4124 > >> > >> int tcp_rtx_synack(const struct sock *sk, struct request_sock *req) > >> { > >> const struct tcp_request_sock_ops *af_ops = > > tcp_rsk(req)->af_specific; > >> struct flowi fl; > >> int res; > >> > >> /* Paired with WRITE_ONCE() in sock_setsockopt() */ > >> if (READ_ONCE(sk->sk_txrehash) == SOCK_TXREHASH_ENABLED) > >> tcp_rsk(req)->txhash = net_tx_rndhash(); > >> res = af_ops->send_synack(sk, NULL, &fl, req, NULL, > > TCP_SYNACK_NORMAL, > >> NULL); > >> [...] > >> > >> So suddenly all subsequent packets (skbs) will have a different 'hash' > >> value and might get hashed to a different path. > >> > >> This last part was not fixed. After a discussion on the netdev mailing > >> list the conclusion was to not change the hash behavior for locally > >> terminated TCP sessions: > >> > >> > > https://lore.kernel.org/all/20230511093456.672221-1-aten...@kernel.org/T/#m1fd8da6d912d5acead66c4c397396e6ac4b160dd > >> > >> Eventually the documentation change patch was dropped and a v2 of the > >> other fixes was accepted: > >> https://www.spinics.net/lists/netdev/msg906252.html > >> > >> The rehash on retransmit problem is still present though making the > >> kernel's dp-hash implementation unusable. > >> > >> During some internal discussions Aaron (in CC) pointed out that OVS > >> supports L4 symmetric hashing and that should be a stable L4 hash. OVS > >> already reports the maximum supported hashing algorithm through the > >> Datapath.Capabilities.max_hash_alg ovsdb field. The only problem was > >> that the kernel datapath didn't support L4 symmetric hashing. But Aaron > >> posted a patch to implement that and the patch got accepted: > >> > >> https://www.spinics.net/lists/netdev/msg911045.html > >> > >> Then we moved OVN to use L4 symmetric hashing if available. That's to > >> ensure correctness of the packet processing. We did run performance > >> tests and the hit due to the extra hashing was acceptable, max 3% > >> throughput degradation in OVN scenarios that simulate common traffic > >> patterns (like ovn-kubernetes ones) - in most cases the performance > >> degradation was around 2%. > >> > >> Sorry for the long story but I hope it provides all the required > >> context. Thinking more about it I should've probably added some of this > >> to the original commit log, I apologize. > >> > >> Regards, > >> Dumitru > >> > > > > Thanks Dumitru for the details! It is now quite clear and I am not against > > backporting it. > > But I have a question regarding upgrading. For the existing traffic, does > > this mean the packets may choose a different path after OVN upgrading? > > Should this be called out? > > > > Hi Han, > > That's a good question. They might and this affects all traffic that's > not locally originated for the kernel datapath and (potentially) all > traffic for the userspace datapath. I can post a follow up patch to > update the NEWS file. Did you have anything else in mind? > Sounds good. I don't have anything else.
Thanks, Han > Regards, > Dumitru > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev