On 8/12/23 07:19, Han Zhou wrote:
> On Fri, Aug 11, 2023 at 6:23 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>
>> On 8/1/23 12:51, Han Zhou wrote:
>>> On Tue, Jul 18, 2023 at 6:16 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>>>
>>>> On 7/18/23 12:14, Han Zhou wrote:
>>>>> On Mon, Jul 17, 2023 at 9:51 PM Dumitru Ceara <dce...@redhat.com>
> wrote:
>>>>>>
>>>>>> On 7/14/23 08:41, Ales Musil wrote:
>>>>>>> On Thu, Jul 13, 2023 at 4:39 PM Dumitru Ceara <dce...@redhat.com>
>>> wrote:
>>>>>>>
>>>>>>>> Regular dp-hash is not a canonical L4 hash (at least with the
> netlink
>>>>>>>> datapath).  If the datapath supports l4 symmetrical dp-hash use
> that
>>>>> one
>>>>>>>> instead.
>>>>>>>>
>>>>>>>> Reported-at: https://github.com/ovn-org/ovn/issues/112
>>>>>>>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2188679
>>>>>>>> Signed-off-by: Dumitru Ceara <dce...@redhat.com>
>>>>>>>>
>>>>>>>
>>>>>>> Hi Dumitru,
>>>>>>>
>>>>>>>
>>>>>>>> ---
>>>>>>>>  include/ovn/features.h |  2 ++
>>>>>>>>  lib/actions.c          |  6 ++++++
>>>>>>>>  lib/features.c         | 49
>>> +++++++++++++++++++++++++++++++++---------
>>>>>>>>  3 files changed, 47 insertions(+), 10 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/include/ovn/features.h b/include/ovn/features.h
>>>>>>>> index de8f1f5489..3bf536127f 100644
>>>>>>>> --- a/include/ovn/features.h
>>>>>>>> +++ b/include/ovn/features.h
>>>>>>>> @@ -34,12 +34,14 @@ enum ovs_feature_support_bits {
>>>>>>>>      OVS_CT_ZERO_SNAT_SUPPORT_BIT,
>>>>>>>>      OVS_DP_METER_SUPPORT_BIT,
>>>>>>>>      OVS_CT_TUPLE_FLUSH_BIT,
>>>>>>>> +    OVS_DP_HASH_L4_SYM_BIT,
>>>>>>>>  };
>>>>>>>>
>>>>>>>>  enum ovs_feature_value {
>>>>>>>>      OVS_CT_ZERO_SNAT_SUPPORT = (1 <<
> OVS_CT_ZERO_SNAT_SUPPORT_BIT),
>>>>>>>>      OVS_DP_METER_SUPPORT = (1 << OVS_DP_METER_SUPPORT_BIT),
>>>>>>>>      OVS_CT_TUPLE_FLUSH_SUPPORT = (1 << OVS_CT_TUPLE_FLUSH_BIT),
>>>>>>>> +    OVS_DP_HASH_L4_SYM_SUPPORT = (1 << OVS_DP_HASH_L4_SYM_BIT),
>>>>>>>>  };
>>>>>>>>
>>>>>>>>  void ovs_feature_support_destroy(void);
>>>>>>>> diff --git a/lib/actions.c b/lib/actions.c
>>>>>>>> index 037172e606..9d52cb75a8 100644
>>>>>>>> --- a/lib/actions.c
>>>>>>>> +++ b/lib/actions.c
>>>>>>>> @@ -1625,6 +1625,12 @@ encode_SELECT(const struct ovnact_select
>>>>> *select,
>>>>>>>>      struct ds ds = DS_EMPTY_INITIALIZER;
>>>>>>>>      ds_put_format(&ds, "type=select,selection_method=dp_hash");
>>>>>>>>
>>>>>>>> +    if (ovs_feature_is_supported(OVS_DP_HASH_L4_SYM_SUPPORT)) {
>>>>>>>> +        /* Select dp-hash l4_symmetric by setting the upper 32bits
>>> of
>>>>>>>> +         * selection_method_param to 1: */
>>>>>>>>
>>>>>>>
>>>>>>> This comment is a bit unfortunate, because it reads like you want to
>>> set
>>>>>>> all bits of the upper half
>>>>>>> to 1 e.g. 0xffffffff. Maybe change it to: "selection_method_param to
>>>>> value
>>>>>>> 1 (1 << 32)." during merge, wdyt?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Makes sense, thanks for the suggestion!  I changed it accordingly.
>>>>>>
>>>>>>>> +        ds_put_cstr(&ds, ",selection_method_param=0x100000000");
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>>      struct mf_subfield sf =
> expr_resolve_field(&select->res_field);
>>>>>>>>
>>>>>>>>      for (size_t bucket_id = 0; bucket_id < select->n_dsts;
>>>>> bucket_id++) {
>>>>>>>> diff --git a/lib/features.c b/lib/features.c
>>>>>>>> index 97c9c86f00..d24e8f6c5c 100644
>>>>>>>> --- a/lib/features.c
>>>>>>>> +++ b/lib/features.c
>>>>>>>> @@ -21,6 +21,7 @@
>>>>>>>>  #include "lib/dirs.h"
>>>>>>>>  #include "socket-util.h"
>>>>>>>>  #include "lib/vswitch-idl.h"
>>>>>>>> +#include "odp-netlink.h"
>>>>>>>>  #include "openvswitch/vlog.h"
>>>>>>>>  #include "openvswitch/ofpbuf.h"
>>>>>>>>  #include "openvswitch/rconn.h"
>>>>>>>> @@ -33,20 +34,48 @@ VLOG_DEFINE_THIS_MODULE(features);
>>>>>>>>
>>>>>>>>  #define FEATURES_DEFAULT_PROBE_INTERVAL_SEC 5
>>>>>>>>
>>>>>>>> +/* Parses 'cap_name' from 'ovs_capabilities' and returns whether
> the
>>>>>>>> + * type of capability is supported or not. */
>>>>>>>> +typedef bool ovs_feature_parse_func(const struct smap
>>>>> *ovs_capabilities,
>>>>>>>> +                                    const char *cap_name);
>>>>>>>> +
>>>>>>>>  struct ovs_feature {
>>>>>>>>      enum ovs_feature_value value;
>>>>>>>>      const char *name;
>>>>>>>> +    ovs_feature_parse_func *parse;
>>>>>>>>  };
>>>>>>>>
>>>>>>>> +static bool
>>>>>>>> +bool_parser(const struct smap *ovs_capabilities, const char
>>> *cap_name)
>>>>>>>> +{
>>>>>>>> +    return smap_get_bool(ovs_capabilities, cap_name, false);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static bool
>>>>>>>> +dp_hash_l4_sym_support_parser(const struct smap *ovs_capabilities,
>>>>>>>> +                              const char *cap_name OVS_UNUSED)
>>>>>>>> +{
>>>>>>>> +    int max_hash_alg = smap_get_int(ovs_capabilities,
>>> "max_hash_alg",
>>>>> 0);
>>>>>>>> +
>>>>>>>> +    return max_hash_alg == OVS_HASH_ALG_SYM_L4;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static struct ovs_feature all_ovs_features[] = {
>>>>>>>>      {
>>>>>>>>          .value = OVS_CT_ZERO_SNAT_SUPPORT,
>>>>>>>> -        .name = "ct_zero_snat"
>>>>>>>> +        .name = "ct_zero_snat",
>>>>>>>> +        .parse = bool_parser,
>>>>>>>>      },
>>>>>>>>      {
>>>>>>>>          .value = OVS_CT_TUPLE_FLUSH_SUPPORT,
>>>>>>>> -        .name = "ct_flush"
>>>>>>>> -    }
>>>>>>>> +        .name = "ct_flush",
>>>>>>>> +        .parse = bool_parser,
>>>>>>>> +    },
>>>>>>>> +    {
>>>>>>>> +        .value = OVS_DP_HASH_L4_SYM_SUPPORT,
>>>>>>>> +        .name = "dp_hash_l4_sym_support",
>>>>>>>> +        .parse = dp_hash_l4_sym_support_parser,
>>>>>>>> +    },
>>>>>>>>  };
>>>>>>>>
>>>>>>>>  /* A bitmap of OVS features that have been detected as
> 'supported'.
>>> */
>>>>>>>> @@ -65,6 +94,7 @@ ovs_feature_is_valid(enum ovs_feature_value
>>> feature)
>>>>>>>>      case OVS_CT_ZERO_SNAT_SUPPORT:
>>>>>>>>      case OVS_DP_METER_SUPPORT:
>>>>>>>>      case OVS_CT_TUPLE_FLUSH_SUPPORT:
>>>>>>>> +    case OVS_DP_HASH_L4_SYM_SUPPORT:
>>>>>>>>          return true;
>>>>>>>>      default:
>>>>>>>>          return false;
>>>>>>>> @@ -183,18 +213,17 @@ ovs_feature_support_run(const struct smap
>>>>>>>> *ovs_capabilities,
>>>>>>>>      }
>>>>>>>>
>>>>>>>>      for (size_t i = 0; i < ARRAY_SIZE(all_ovs_features); i++) {
>>>>>>>> -        enum ovs_feature_value value = all_ovs_features[i].value;
>>>>>>>> -        const char *name = all_ovs_features[i].name;
>>>>>>>> -        bool old_state = supported_ovs_features & value;
>>>>>>>> -        bool new_state = smap_get_bool(ovs_capabilities, name,
>>> false);
>>>>>>>> +        struct ovs_feature *feature = &all_ovs_features[i];
>>>>>>>> +        bool old_state = supported_ovs_features & feature->value;
>>>>>>>> +        bool new_state = feature->parse(ovs_capabilities,
>>>>> feature->name);
>>>>>>>>          if (new_state != old_state) {
>>>>>>>>              updated = true;
>>>>>>>>              if (new_state) {
>>>>>>>> -                supported_ovs_features |= value;
>>>>>>>> +                supported_ovs_features |= feature->value;
>>>>>>>>              } else {
>>>>>>>> -                supported_ovs_features &= ~value;
>>>>>>>> +                supported_ovs_features &= ~feature->value;
>>>>>>>>              }
>>>>>>>> -            VLOG_INFO_RL(&rl, "OVS Feature: %s, state: %s", name,
>>>>>>>> +            VLOG_INFO_RL(&rl, "OVS Feature: %s, state: %s",
>>>>> feature->name,
>>>>>>>>                           new_state ? "supported" : "not
> supported");
>>>>>>>>          }
>>>>>>>>      }
>>>>>>>> --
>>>>>>>> 2.31.1
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> dev mailing list
>>>>>>>> d...@openvswitch.org
>>>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>>>>>
>>>>>>>>
>>>>>>> Other than that it looks good.
>>>>>>>
>>>>>>> Acked-by: Ales Musil <amu...@redhat.com>
>>>>>>>
>>>>>>
>>>>>> Thanks, I applied this to main.
>>>>>>
>>>>>> I was wondering if we should backport this though.  It's not exactly
> a
>>>>>> bug fix but it does avoid using a broken datapath feature (when
>>>>>> possible) so it avoids a broken behavior.
>>>>>>
>>>>>> The changes are quite contained.
>>>>>>
>>>>>> CC-ing some of the maintainers explicitly to get some more eyes on
>>> this.
>>>>>>
>>>>> Thanks Dumitru. Sorry that I was following the reported issue. I tried
>>> to
>>>>> look for some more details, but I am not authorized to access bug
>>> 2175716.
>>>>
>>>> Sorry about that, that bug used to be public but some internal policies
>>>> made it private.
>>>>
>>>>> It is not quite clear to me what the impact of this problem is. The
> OVN
>>>>> issue (112) seems to be a very common configuration of OVN ECMP and I
>>>>> assume the problem is not easily reproduced because I have been
>>>>> testing/deploying the feature and it worked as expected. Do you have
>>> more
>>>>> details about the trigger? Are there any related fixes in OVS and
>>> kernel?
>>>>>
>>>>
>>>> The original bug report came from OpenShift:
>>>> https://issues.redhat.com/browse/OCPBUGS-7406
>>>>
>>>> CC-ing Patryk as he root caused the original issue.
>>>>
>>>> The simplified topology is:
>>>>                               +--- GW1 ---+
>>>> client ---- cluster-router ---+           +--- (out of OVN) --- server
>>>>                               +--- GW2 ---+
>>>>
>>>> On cluster-router there's an ECMP route to reach 'server' via:
>>>> - GW1
>>>> - GW2
>>>>
>>>> On GW1 traffic going out of OVN is SNATed to IP1.
>>>> On GW2 traffic going out of OVN is SNATed to IP2.
>>>>
>>>> The client initiates a TCP session towards the server; a path is chosen
>>>> based on dp-hash, e.g. GW2.
>>>>
>>>> From this point on, during normal operation, traffic on this session
>>>> flows fine via GW2.
>>>>
>>>> If the client closes the connection, then for the closing connection
>>>> there's an exception when computing the dp-hash if the socket is in
>>>> state TIME_WAIT, substate TCP_FIN_WAIT_2:
>>>>
> https://elixir.bootlin.com/linux/v6.2.1/source/net/ipv4/tcp_ipv4.c#L2218
>>>> In this last case we call:
>>>>
>>>> tcp_v4_timewait_ack(sk, skb)
>>>> |
>>>> tcp_v4_send_ack(sk, skb, ...)
>>>>
>>>> The latter uses a global, per-cpu "ctl_sk" to build the last ACK
> packet:
>>>> https://elixir.bootlin.com/linux/v6.2.1/source/net/ipv4/tcp_ipv4.c#L925
>>>>
>>>>         ctl_sk = this_cpu_read(ipv4_tcp_sk);
>>>>         sock_net_set(ctl_sk, net);
>>>>         ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
>>>>                            inet_twsk(sk)->tw_mark : sk->sk_mark;
>>>>         ctl_sk->sk_priority = (sk->sk_state == TCP_TIME_WAIT) ?
>>>>                            inet_twsk(sk)->tw_priority :
> sk->sk_priority;
>>>>         transmit_time = tcp_transmit_time(sk);
>>>>         ip_send_unicast_reply(ctl_sk,
>>>>                               skb, &TCP_SKB_CB(skb)->header.h4.opt,
>>>>                               ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
>>>>                               &arg, arg.iov[0].iov_len,
>>>>                               transmit_time);
>>>>
>>>> As ctl_sk's sk_txhash is 0, when transmitting this last ack, its hash
> is
>>>> computed from the packet contents:
>>>>
>>>
> https://elixir.bootlin.com/linux/v6.2.1/source/include/linux/skbuff.h#L1540
>>>>
>>>> static inline __u32 skb_get_hash(struct sk_buff *skb)
>>>> {
>>>>         if (!skb->l4_hash && !skb->sw_hash)
>>>>                 __skb_get_hash(skb);
>>>>
>>>>         return skb->hash;
>>>> }
>>>> [...]
>>>> void __skb_get_hash(struct sk_buff *skb)
>>>> {
>>>>         struct flow_keys keys;
>>>>         u32 hash;
>>>>
>>>>         __flow_hash_secret_init();
>>>>
>>>>         hash = ___skb_get_hash(skb, &keys, &hashrnd);
>>>>
>>>>         __skb_set_sw_hash(skb, hash, flow_keys_have_l4(&keys));
>>>> }
>>>>
>>>> But the original socket's txhash was actually a random hash:
>>>> https://elixir.bootlin.com/linux/v6.2.1/source/include/net/sock.h#L2128
>>>>
>>>> static inline void sk_set_txhash(struct sock *sk)
>>>> {
>>>>         /* This pairs with READ_ONCE() in skb_set_hash_from_sk() */
>>>>         WRITE_ONCE(sk->sk_txhash, net_tx_rndhash());
>>>> }
>>>>
>>>> This means that the last ack might enter the ovs datapath with a
>>>> different dp_hash than all other packets sent on that session in that
>>>> direction. It's possible that this last ack gets hashed to a different
>>>> path of the ECMP route.
>>>>
>>>> This specific issue was fixed in the kernel through:
>>>>
>>>>
>>>
> https://lore.kernel.org/all/20230511093456.672221-1-aten...@kernel.org/T/#mc84997c4cc755354e5527b41b7a15869536a21f4
>>>>
>>>> However, while debugging we realized the problem is actually visible
>>>> with _any_ retransmit.  That's because a retransmit will trigger a
> rehash:
>>>>
>>>>
> https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_output.c#L4124
>>>>
>>>> int tcp_rtx_synack(const struct sock *sk, struct request_sock *req)
>>>> {
>>>>         const struct tcp_request_sock_ops *af_ops =
>>> tcp_rsk(req)->af_specific;
>>>>         struct flowi fl;
>>>>         int res;
>>>>
>>>>         /* Paired with WRITE_ONCE() in sock_setsockopt() */
>>>>         if (READ_ONCE(sk->sk_txrehash) == SOCK_TXREHASH_ENABLED)
>>>>                 tcp_rsk(req)->txhash = net_tx_rndhash();
>>>>         res = af_ops->send_synack(sk, NULL, &fl, req, NULL,
>>> TCP_SYNACK_NORMAL,
>>>>                                   NULL);
>>>> [...]
>>>>
>>>> So suddenly all subsequent packets (skbs) will have a different 'hash'
>>>> value and might get hashed to a different path.
>>>>
>>>> This last part was not fixed.  After a discussion on the netdev mailing
>>>> list the conclusion was to not change the hash behavior for locally
>>>> terminated TCP sessions:
>>>>
>>>>
>>>
> https://lore.kernel.org/all/20230511093456.672221-1-aten...@kernel.org/T/#m1fd8da6d912d5acead66c4c397396e6ac4b160dd
>>>>
>>>> Eventually the documentation change patch was dropped and a v2 of the
>>>> other fixes was accepted:
>>>> https://www.spinics.net/lists/netdev/msg906252.html
>>>>
>>>> The rehash on retransmit problem is still present though making the
>>>> kernel's dp-hash implementation unusable.
>>>>
>>>> During some internal discussions Aaron (in CC) pointed out that OVS
>>>> supports L4 symmetric hashing and that should be a stable L4 hash.  OVS
>>>> already reports the maximum supported hashing algorithm through the
>>>> Datapath.Capabilities.max_hash_alg ovsdb field.  The only problem was
>>>> that the kernel datapath didn't support L4 symmetric hashing.  But
> Aaron
>>>> posted a patch to implement that and the patch got accepted:
>>>>
>>>> https://www.spinics.net/lists/netdev/msg911045.html
>>>>
>>>> Then we moved OVN to use L4 symmetric hashing if available.  That's to
>>>> ensure correctness of the packet processing.  We did run performance
>>>> tests and the hit due to the extra hashing was acceptable, max 3%
>>>> throughput degradation in OVN scenarios that simulate common traffic
>>>> patterns (like ovn-kubernetes ones) - in most cases the performance
>>>> degradation was around 2%.
>>>>
>>>> Sorry for the long story but I hope it provides all the required
>>>> context.  Thinking more about it I should've probably added some of
> this
>>>> to the original commit log, I apologize.
>>>>
>>>> Regards,
>>>> Dumitru
>>>>
>>>
>>> Thanks Dumitru for the details! It is now quite clear and I am not
> against
>>> backporting it.
>>> But I have a question regarding upgrading. For the existing traffic,
> does
>>> this mean the packets may choose a different path after OVN upgrading?
>>> Should this be called out?
>>>
>>
>> Hi Han,
>>
>> That's a good question.  They might and this affects all traffic that's
>> not locally originated for the kernel datapath and (potentially) all
>> traffic for the userspace datapath.  I can post a follow up patch to
>> update the NEWS file.  Did you have anything else in mind?
>>
> Sounds good. I don't have anything else.
> 

I posted
https://patchwork.ozlabs.org/project/ovn/patch/20230814135401.2677054-1-dce...@redhat.com/

Thanks,
Dumitru

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to