On 8/15/23 12:39, Dumitru Ceara wrote: > On 8/14/23 15:57, Dumitru Ceara wrote: >> On 8/12/23 07:19, Han Zhou wrote: >>> On Fri, Aug 11, 2023 at 6:23 AM Dumitru Ceara <dce...@redhat.com> wrote: >>>> >>>> On 8/1/23 12:51, Han Zhou wrote: >>>>> On Tue, Jul 18, 2023 at 6:16 AM Dumitru Ceara <dce...@redhat.com> wrote: >>>>>> >>>>>> On 7/18/23 12:14, Han Zhou wrote: >>>>>>> On Mon, Jul 17, 2023 at 9:51 PM Dumitru Ceara <dce...@redhat.com> >>> wrote: >>>>>>>> >>>>>>>> On 7/14/23 08:41, Ales Musil wrote: >>>>>>>>> On Thu, Jul 13, 2023 at 4:39 PM Dumitru Ceara <dce...@redhat.com> >>>>> wrote: >>>>>>>>> >>>>>>>>>> Regular dp-hash is not a canonical L4 hash (at least with the >>> netlink >>>>>>>>>> datapath). If the datapath supports l4 symmetrical dp-hash use >>> that >>>>>>> one >>>>>>>>>> instead. >>>>>>>>>> >>>>>>>>>> Reported-at: https://github.com/ovn-org/ovn/issues/112 >>>>>>>>>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2188679 >>>>>>>>>> Signed-off-by: Dumitru Ceara <dce...@redhat.com> >>>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Dumitru, >>>>>>>>> >>>>>>>>> >>>>>>>>>> --- >>>>>>>>>> include/ovn/features.h | 2 ++ >>>>>>>>>> lib/actions.c | 6 ++++++ >>>>>>>>>> lib/features.c | 49 >>>>> +++++++++++++++++++++++++++++++++--------- >>>>>>>>>> 3 files changed, 47 insertions(+), 10 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/include/ovn/features.h b/include/ovn/features.h >>>>>>>>>> index de8f1f5489..3bf536127f 100644 >>>>>>>>>> --- a/include/ovn/features.h >>>>>>>>>> +++ b/include/ovn/features.h >>>>>>>>>> @@ -34,12 +34,14 @@ enum ovs_feature_support_bits { >>>>>>>>>> OVS_CT_ZERO_SNAT_SUPPORT_BIT, >>>>>>>>>> OVS_DP_METER_SUPPORT_BIT, >>>>>>>>>> OVS_CT_TUPLE_FLUSH_BIT, >>>>>>>>>> + OVS_DP_HASH_L4_SYM_BIT, >>>>>>>>>> }; >>>>>>>>>> >>>>>>>>>> enum ovs_feature_value { >>>>>>>>>> OVS_CT_ZERO_SNAT_SUPPORT = (1 << >>> OVS_CT_ZERO_SNAT_SUPPORT_BIT), >>>>>>>>>> OVS_DP_METER_SUPPORT = (1 << OVS_DP_METER_SUPPORT_BIT), >>>>>>>>>> OVS_CT_TUPLE_FLUSH_SUPPORT = (1 << OVS_CT_TUPLE_FLUSH_BIT), >>>>>>>>>> + OVS_DP_HASH_L4_SYM_SUPPORT = (1 << OVS_DP_HASH_L4_SYM_BIT), >>>>>>>>>> }; >>>>>>>>>> >>>>>>>>>> void ovs_feature_support_destroy(void); >>>>>>>>>> diff --git a/lib/actions.c b/lib/actions.c >>>>>>>>>> index 037172e606..9d52cb75a8 100644 >>>>>>>>>> --- a/lib/actions.c >>>>>>>>>> +++ b/lib/actions.c >>>>>>>>>> @@ -1625,6 +1625,12 @@ encode_SELECT(const struct ovnact_select >>>>>>> *select, >>>>>>>>>> struct ds ds = DS_EMPTY_INITIALIZER; >>>>>>>>>> ds_put_format(&ds, "type=select,selection_method=dp_hash"); >>>>>>>>>> >>>>>>>>>> + if (ovs_feature_is_supported(OVS_DP_HASH_L4_SYM_SUPPORT)) { >>>>>>>>>> + /* Select dp-hash l4_symmetric by setting the upper 32bits >>>>> of >>>>>>>>>> + * selection_method_param to 1: */ >>>>>>>>>> >>>>>>>>> >>>>>>>>> This comment is a bit unfortunate, because it reads like you want to >>>>> set >>>>>>>>> all bits of the upper half >>>>>>>>> to 1 e.g. 0xffffffff. Maybe change it to: "selection_method_param to >>>>>>> value >>>>>>>>> 1 (1 << 32)." during merge, wdyt? >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Makes sense, thanks for the suggestion! I changed it accordingly. >>>>>>>> >>>>>>>>>> + ds_put_cstr(&ds, ",selection_method_param=0x100000000"); >>>>>>>>>> + } >>>>>>>>>> + >>>>>>>>>> struct mf_subfield sf = >>> expr_resolve_field(&select->res_field); >>>>>>>>>> >>>>>>>>>> for (size_t bucket_id = 0; bucket_id < select->n_dsts; >>>>>>> bucket_id++) { >>>>>>>>>> diff --git a/lib/features.c b/lib/features.c >>>>>>>>>> index 97c9c86f00..d24e8f6c5c 100644 >>>>>>>>>> --- a/lib/features.c >>>>>>>>>> +++ b/lib/features.c >>>>>>>>>> @@ -21,6 +21,7 @@ >>>>>>>>>> #include "lib/dirs.h" >>>>>>>>>> #include "socket-util.h" >>>>>>>>>> #include "lib/vswitch-idl.h" >>>>>>>>>> +#include "odp-netlink.h" >>>>>>>>>> #include "openvswitch/vlog.h" >>>>>>>>>> #include "openvswitch/ofpbuf.h" >>>>>>>>>> #include "openvswitch/rconn.h" >>>>>>>>>> @@ -33,20 +34,48 @@ VLOG_DEFINE_THIS_MODULE(features); >>>>>>>>>> >>>>>>>>>> #define FEATURES_DEFAULT_PROBE_INTERVAL_SEC 5 >>>>>>>>>> >>>>>>>>>> +/* Parses 'cap_name' from 'ovs_capabilities' and returns whether >>> the >>>>>>>>>> + * type of capability is supported or not. */ >>>>>>>>>> +typedef bool ovs_feature_parse_func(const struct smap >>>>>>> *ovs_capabilities, >>>>>>>>>> + const char *cap_name); >>>>>>>>>> + >>>>>>>>>> struct ovs_feature { >>>>>>>>>> enum ovs_feature_value value; >>>>>>>>>> const char *name; >>>>>>>>>> + ovs_feature_parse_func *parse; >>>>>>>>>> }; >>>>>>>>>> >>>>>>>>>> +static bool >>>>>>>>>> +bool_parser(const struct smap *ovs_capabilities, const char >>>>> *cap_name) >>>>>>>>>> +{ >>>>>>>>>> + return smap_get_bool(ovs_capabilities, cap_name, false); >>>>>>>>>> +} >>>>>>>>>> + >>>>>>>>>> +static bool >>>>>>>>>> +dp_hash_l4_sym_support_parser(const struct smap *ovs_capabilities, >>>>>>>>>> + const char *cap_name OVS_UNUSED) >>>>>>>>>> +{ >>>>>>>>>> + int max_hash_alg = smap_get_int(ovs_capabilities, >>>>> "max_hash_alg", >>>>>>> 0); >>>>>>>>>> + >>>>>>>>>> + return max_hash_alg == OVS_HASH_ALG_SYM_L4; >>>>>>>>>> +} >>>>>>>>>> + >>>>>>>>>> static struct ovs_feature all_ovs_features[] = { >>>>>>>>>> { >>>>>>>>>> .value = OVS_CT_ZERO_SNAT_SUPPORT, >>>>>>>>>> - .name = "ct_zero_snat" >>>>>>>>>> + .name = "ct_zero_snat", >>>>>>>>>> + .parse = bool_parser, >>>>>>>>>> }, >>>>>>>>>> { >>>>>>>>>> .value = OVS_CT_TUPLE_FLUSH_SUPPORT, >>>>>>>>>> - .name = "ct_flush" >>>>>>>>>> - } >>>>>>>>>> + .name = "ct_flush", >>>>>>>>>> + .parse = bool_parser, >>>>>>>>>> + }, >>>>>>>>>> + { >>>>>>>>>> + .value = OVS_DP_HASH_L4_SYM_SUPPORT, >>>>>>>>>> + .name = "dp_hash_l4_sym_support", >>>>>>>>>> + .parse = dp_hash_l4_sym_support_parser, >>>>>>>>>> + }, >>>>>>>>>> }; >>>>>>>>>> >>>>>>>>>> /* A bitmap of OVS features that have been detected as >>> 'supported'. >>>>> */ >>>>>>>>>> @@ -65,6 +94,7 @@ ovs_feature_is_valid(enum ovs_feature_value >>>>> feature) >>>>>>>>>> case OVS_CT_ZERO_SNAT_SUPPORT: >>>>>>>>>> case OVS_DP_METER_SUPPORT: >>>>>>>>>> case OVS_CT_TUPLE_FLUSH_SUPPORT: >>>>>>>>>> + case OVS_DP_HASH_L4_SYM_SUPPORT: >>>>>>>>>> return true; >>>>>>>>>> default: >>>>>>>>>> return false; >>>>>>>>>> @@ -183,18 +213,17 @@ ovs_feature_support_run(const struct smap >>>>>>>>>> *ovs_capabilities, >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> for (size_t i = 0; i < ARRAY_SIZE(all_ovs_features); i++) { >>>>>>>>>> - enum ovs_feature_value value = all_ovs_features[i].value; >>>>>>>>>> - const char *name = all_ovs_features[i].name; >>>>>>>>>> - bool old_state = supported_ovs_features & value; >>>>>>>>>> - bool new_state = smap_get_bool(ovs_capabilities, name, >>>>> false); >>>>>>>>>> + struct ovs_feature *feature = &all_ovs_features[i]; >>>>>>>>>> + bool old_state = supported_ovs_features & feature->value; >>>>>>>>>> + bool new_state = feature->parse(ovs_capabilities, >>>>>>> feature->name); >>>>>>>>>> if (new_state != old_state) { >>>>>>>>>> updated = true; >>>>>>>>>> if (new_state) { >>>>>>>>>> - supported_ovs_features |= value; >>>>>>>>>> + supported_ovs_features |= feature->value; >>>>>>>>>> } else { >>>>>>>>>> - supported_ovs_features &= ~value; >>>>>>>>>> + supported_ovs_features &= ~feature->value; >>>>>>>>>> } >>>>>>>>>> - VLOG_INFO_RL(&rl, "OVS Feature: %s, state: %s", name, >>>>>>>>>> + VLOG_INFO_RL(&rl, "OVS Feature: %s, state: %s", >>>>>>> feature->name, >>>>>>>>>> new_state ? "supported" : "not >>> supported"); >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> -- >>>>>>>>>> 2.31.1 >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> dev mailing list >>>>>>>>>> d...@openvswitch.org >>>>>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>>>>>>>>> >>>>>>>>>> >>>>>>>>> Other than that it looks good. >>>>>>>>> >>>>>>>>> Acked-by: Ales Musil <amu...@redhat.com> >>>>>>>>> >>>>>>>> >>>>>>>> Thanks, I applied this to main. >>>>>>>> >>>>>>>> I was wondering if we should backport this though. It's not exactly >>> a >>>>>>>> bug fix but it does avoid using a broken datapath feature (when >>>>>>>> possible) so it avoids a broken behavior. >>>>>>>> >>>>>>>> The changes are quite contained. >>>>>>>> >>>>>>>> CC-ing some of the maintainers explicitly to get some more eyes on >>>>> this. >>>>>>>> >>>>>>> Thanks Dumitru. Sorry that I was following the reported issue. I tried >>>>> to >>>>>>> look for some more details, but I am not authorized to access bug >>>>> 2175716. >>>>>> >>>>>> Sorry about that, that bug used to be public but some internal policies >>>>>> made it private. >>>>>> >>>>>>> It is not quite clear to me what the impact of this problem is. The >>> OVN >>>>>>> issue (112) seems to be a very common configuration of OVN ECMP and I >>>>>>> assume the problem is not easily reproduced because I have been >>>>>>> testing/deploying the feature and it worked as expected. Do you have >>>>> more >>>>>>> details about the trigger? Are there any related fixes in OVS and >>>>> kernel? >>>>>>> >>>>>> >>>>>> The original bug report came from OpenShift: >>>>>> https://issues.redhat.com/browse/OCPBUGS-7406 >>>>>> >>>>>> CC-ing Patryk as he root caused the original issue. >>>>>> >>>>>> The simplified topology is: >>>>>> +--- GW1 ---+ >>>>>> client ---- cluster-router ---+ +--- (out of OVN) --- server >>>>>> +--- GW2 ---+ >>>>>> >>>>>> On cluster-router there's an ECMP route to reach 'server' via: >>>>>> - GW1 >>>>>> - GW2 >>>>>> >>>>>> On GW1 traffic going out of OVN is SNATed to IP1. >>>>>> On GW2 traffic going out of OVN is SNATed to IP2. >>>>>> >>>>>> The client initiates a TCP session towards the server; a path is chosen >>>>>> based on dp-hash, e.g. GW2. >>>>>> >>>>>> From this point on, during normal operation, traffic on this session >>>>>> flows fine via GW2. >>>>>> >>>>>> If the client closes the connection, then for the closing connection >>>>>> there's an exception when computing the dp-hash if the socket is in >>>>>> state TIME_WAIT, substate TCP_FIN_WAIT_2: >>>>>> >>> https://elixir.bootlin.com/linux/v6.2.1/source/net/ipv4/tcp_ipv4.c#L2218 >>>>>> In this last case we call: >>>>>> >>>>>> tcp_v4_timewait_ack(sk, skb) >>>>>> | >>>>>> tcp_v4_send_ack(sk, skb, ...) >>>>>> >>>>>> The latter uses a global, per-cpu "ctl_sk" to build the last ACK >>> packet: >>>>>> https://elixir.bootlin.com/linux/v6.2.1/source/net/ipv4/tcp_ipv4.c#L925 >>>>>> >>>>>> ctl_sk = this_cpu_read(ipv4_tcp_sk); >>>>>> sock_net_set(ctl_sk, net); >>>>>> ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ? >>>>>> inet_twsk(sk)->tw_mark : sk->sk_mark; >>>>>> ctl_sk->sk_priority = (sk->sk_state == TCP_TIME_WAIT) ? >>>>>> inet_twsk(sk)->tw_priority : >>> sk->sk_priority; >>>>>> transmit_time = tcp_transmit_time(sk); >>>>>> ip_send_unicast_reply(ctl_sk, >>>>>> skb, &TCP_SKB_CB(skb)->header.h4.opt, >>>>>> ip_hdr(skb)->saddr, ip_hdr(skb)->daddr, >>>>>> &arg, arg.iov[0].iov_len, >>>>>> transmit_time); >>>>>> >>>>>> As ctl_sk's sk_txhash is 0, when transmitting this last ack, its hash >>> is >>>>>> computed from the packet contents: >>>>>> >>>>> >>> https://elixir.bootlin.com/linux/v6.2.1/source/include/linux/skbuff.h#L1540 >>>>>> >>>>>> static inline __u32 skb_get_hash(struct sk_buff *skb) >>>>>> { >>>>>> if (!skb->l4_hash && !skb->sw_hash) >>>>>> __skb_get_hash(skb); >>>>>> >>>>>> return skb->hash; >>>>>> } >>>>>> [...] >>>>>> void __skb_get_hash(struct sk_buff *skb) >>>>>> { >>>>>> struct flow_keys keys; >>>>>> u32 hash; >>>>>> >>>>>> __flow_hash_secret_init(); >>>>>> >>>>>> hash = ___skb_get_hash(skb, &keys, &hashrnd); >>>>>> >>>>>> __skb_set_sw_hash(skb, hash, flow_keys_have_l4(&keys)); >>>>>> } >>>>>> >>>>>> But the original socket's txhash was actually a random hash: >>>>>> https://elixir.bootlin.com/linux/v6.2.1/source/include/net/sock.h#L2128 >>>>>> >>>>>> static inline void sk_set_txhash(struct sock *sk) >>>>>> { >>>>>> /* This pairs with READ_ONCE() in skb_set_hash_from_sk() */ >>>>>> WRITE_ONCE(sk->sk_txhash, net_tx_rndhash()); >>>>>> } >>>>>> >>>>>> This means that the last ack might enter the ovs datapath with a >>>>>> different dp_hash than all other packets sent on that session in that >>>>>> direction. It's possible that this last ack gets hashed to a different >>>>>> path of the ECMP route. >>>>>> >>>>>> This specific issue was fixed in the kernel through: >>>>>> >>>>>> >>>>> >>> https://lore.kernel.org/all/20230511093456.672221-1-aten...@kernel.org/T/#mc84997c4cc755354e5527b41b7a15869536a21f4 >>>>>> >>>>>> However, while debugging we realized the problem is actually visible >>>>>> with _any_ retransmit. That's because a retransmit will trigger a >>> rehash: >>>>>> >>>>>> >>> https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_output.c#L4124 >>>>>> >>>>>> int tcp_rtx_synack(const struct sock *sk, struct request_sock *req) >>>>>> { >>>>>> const struct tcp_request_sock_ops *af_ops = >>>>> tcp_rsk(req)->af_specific; >>>>>> struct flowi fl; >>>>>> int res; >>>>>> >>>>>> /* Paired with WRITE_ONCE() in sock_setsockopt() */ >>>>>> if (READ_ONCE(sk->sk_txrehash) == SOCK_TXREHASH_ENABLED) >>>>>> tcp_rsk(req)->txhash = net_tx_rndhash(); >>>>>> res = af_ops->send_synack(sk, NULL, &fl, req, NULL, >>>>> TCP_SYNACK_NORMAL, >>>>>> NULL); >>>>>> [...] >>>>>> >>>>>> So suddenly all subsequent packets (skbs) will have a different 'hash' >>>>>> value and might get hashed to a different path. >>>>>> >>>>>> This last part was not fixed. After a discussion on the netdev mailing >>>>>> list the conclusion was to not change the hash behavior for locally >>>>>> terminated TCP sessions: >>>>>> >>>>>> >>>>> >>> https://lore.kernel.org/all/20230511093456.672221-1-aten...@kernel.org/T/#m1fd8da6d912d5acead66c4c397396e6ac4b160dd >>>>>> >>>>>> Eventually the documentation change patch was dropped and a v2 of the >>>>>> other fixes was accepted: >>>>>> https://www.spinics.net/lists/netdev/msg906252.html >>>>>> >>>>>> The rehash on retransmit problem is still present though making the >>>>>> kernel's dp-hash implementation unusable. >>>>>> >>>>>> During some internal discussions Aaron (in CC) pointed out that OVS >>>>>> supports L4 symmetric hashing and that should be a stable L4 hash. OVS >>>>>> already reports the maximum supported hashing algorithm through the >>>>>> Datapath.Capabilities.max_hash_alg ovsdb field. The only problem was >>>>>> that the kernel datapath didn't support L4 symmetric hashing. But >>> Aaron >>>>>> posted a patch to implement that and the patch got accepted: >>>>>> >>>>>> https://www.spinics.net/lists/netdev/msg911045.html >>>>>> >>>>>> Then we moved OVN to use L4 symmetric hashing if available. That's to >>>>>> ensure correctness of the packet processing. We did run performance >>>>>> tests and the hit due to the extra hashing was acceptable, max 3% >>>>>> throughput degradation in OVN scenarios that simulate common traffic >>>>>> patterns (like ovn-kubernetes ones) - in most cases the performance >>>>>> degradation was around 2%. >>>>>> >>>>>> Sorry for the long story but I hope it provides all the required >>>>>> context. Thinking more about it I should've probably added some of >>> this >>>>>> to the original commit log, I apologize. >>>>>> >>>>>> Regards, >>>>>> Dumitru >>>>>> >>>>> >>>>> Thanks Dumitru for the details! It is now quite clear and I am not >>> against >>>>> backporting it. >>>>> But I have a question regarding upgrading. For the existing traffic, >>> does >>>>> this mean the packets may choose a different path after OVN upgrading? >>>>> Should this be called out? >>>>> >>>> >>>> Hi Han, >>>> >>>> That's a good question. They might and this affects all traffic that's >>>> not locally originated for the kernel datapath and (potentially) all >>>> traffic for the userspace datapath. I can post a follow up patch to >>>> update the NEWS file. Did you have anything else in mind? >>>> >>> Sounds good. I don't have anything else. >>> >> >> I posted >> https://patchwork.ozlabs.org/project/ovn/patch/20230814135401.2677054-1-dce...@redhat.com/ >> > > I backported the commit and the NEWS update to branches 23.06 and 23.03. > It didn't apply cleanly on older branches but I think that's fine for > now. We can backport it further later if users request it. >
We actually got an internal request to backport this to 22.12 so I resolved the conflicts (nothing major) and pushed these commits to branch-22.12 too. Regards, Dumitru _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev