On 8/14/23 15:57, Dumitru Ceara wrote:
> On 8/12/23 07:19, Han Zhou wrote:
>> On Fri, Aug 11, 2023 at 6:23 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>>
>>> On 8/1/23 12:51, Han Zhou wrote:
>>>> On Tue, Jul 18, 2023 at 6:16 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>>>>
>>>>> On 7/18/23 12:14, Han Zhou wrote:
>>>>>> On Mon, Jul 17, 2023 at 9:51 PM Dumitru Ceara <dce...@redhat.com>
>> wrote:
>>>>>>>
>>>>>>> On 7/14/23 08:41, Ales Musil wrote:
>>>>>>>> On Thu, Jul 13, 2023 at 4:39 PM Dumitru Ceara <dce...@redhat.com>
>>>> wrote:
>>>>>>>>
>>>>>>>>> Regular dp-hash is not a canonical L4 hash (at least with the
>> netlink
>>>>>>>>> datapath).  If the datapath supports l4 symmetrical dp-hash use
>> that
>>>>>> one
>>>>>>>>> instead.
>>>>>>>>>
>>>>>>>>> Reported-at: https://github.com/ovn-org/ovn/issues/112
>>>>>>>>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2188679
>>>>>>>>> Signed-off-by: Dumitru Ceara <dce...@redhat.com>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Dumitru,
>>>>>>>>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>  include/ovn/features.h |  2 ++
>>>>>>>>>  lib/actions.c          |  6 ++++++
>>>>>>>>>  lib/features.c         | 49
>>>> +++++++++++++++++++++++++++++++++---------
>>>>>>>>>  3 files changed, 47 insertions(+), 10 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/ovn/features.h b/include/ovn/features.h
>>>>>>>>> index de8f1f5489..3bf536127f 100644
>>>>>>>>> --- a/include/ovn/features.h
>>>>>>>>> +++ b/include/ovn/features.h
>>>>>>>>> @@ -34,12 +34,14 @@ enum ovs_feature_support_bits {
>>>>>>>>>      OVS_CT_ZERO_SNAT_SUPPORT_BIT,
>>>>>>>>>      OVS_DP_METER_SUPPORT_BIT,
>>>>>>>>>      OVS_CT_TUPLE_FLUSH_BIT,
>>>>>>>>> +    OVS_DP_HASH_L4_SYM_BIT,
>>>>>>>>>  };
>>>>>>>>>
>>>>>>>>>  enum ovs_feature_value {
>>>>>>>>>      OVS_CT_ZERO_SNAT_SUPPORT = (1 <<
>> OVS_CT_ZERO_SNAT_SUPPORT_BIT),
>>>>>>>>>      OVS_DP_METER_SUPPORT = (1 << OVS_DP_METER_SUPPORT_BIT),
>>>>>>>>>      OVS_CT_TUPLE_FLUSH_SUPPORT = (1 << OVS_CT_TUPLE_FLUSH_BIT),
>>>>>>>>> +    OVS_DP_HASH_L4_SYM_SUPPORT = (1 << OVS_DP_HASH_L4_SYM_BIT),
>>>>>>>>>  };
>>>>>>>>>
>>>>>>>>>  void ovs_feature_support_destroy(void);
>>>>>>>>> diff --git a/lib/actions.c b/lib/actions.c
>>>>>>>>> index 037172e606..9d52cb75a8 100644
>>>>>>>>> --- a/lib/actions.c
>>>>>>>>> +++ b/lib/actions.c
>>>>>>>>> @@ -1625,6 +1625,12 @@ encode_SELECT(const struct ovnact_select
>>>>>> *select,
>>>>>>>>>      struct ds ds = DS_EMPTY_INITIALIZER;
>>>>>>>>>      ds_put_format(&ds, "type=select,selection_method=dp_hash");
>>>>>>>>>
>>>>>>>>> +    if (ovs_feature_is_supported(OVS_DP_HASH_L4_SYM_SUPPORT)) {
>>>>>>>>> +        /* Select dp-hash l4_symmetric by setting the upper 32bits
>>>> of
>>>>>>>>> +         * selection_method_param to 1: */
>>>>>>>>>
>>>>>>>>
>>>>>>>> This comment is a bit unfortunate, because it reads like you want to
>>>> set
>>>>>>>> all bits of the upper half
>>>>>>>> to 1 e.g. 0xffffffff. Maybe change it to: "selection_method_param to
>>>>>> value
>>>>>>>> 1 (1 << 32)." during merge, wdyt?
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Makes sense, thanks for the suggestion!  I changed it accordingly.
>>>>>>>
>>>>>>>>> +        ds_put_cstr(&ds, ",selection_method_param=0x100000000");
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>>      struct mf_subfield sf =
>> expr_resolve_field(&select->res_field);
>>>>>>>>>
>>>>>>>>>      for (size_t bucket_id = 0; bucket_id < select->n_dsts;
>>>>>> bucket_id++) {
>>>>>>>>> diff --git a/lib/features.c b/lib/features.c
>>>>>>>>> index 97c9c86f00..d24e8f6c5c 100644
>>>>>>>>> --- a/lib/features.c
>>>>>>>>> +++ b/lib/features.c
>>>>>>>>> @@ -21,6 +21,7 @@
>>>>>>>>>  #include "lib/dirs.h"
>>>>>>>>>  #include "socket-util.h"
>>>>>>>>>  #include "lib/vswitch-idl.h"
>>>>>>>>> +#include "odp-netlink.h"
>>>>>>>>>  #include "openvswitch/vlog.h"
>>>>>>>>>  #include "openvswitch/ofpbuf.h"
>>>>>>>>>  #include "openvswitch/rconn.h"
>>>>>>>>> @@ -33,20 +34,48 @@ VLOG_DEFINE_THIS_MODULE(features);
>>>>>>>>>
>>>>>>>>>  #define FEATURES_DEFAULT_PROBE_INTERVAL_SEC 5
>>>>>>>>>
>>>>>>>>> +/* Parses 'cap_name' from 'ovs_capabilities' and returns whether
>> the
>>>>>>>>> + * type of capability is supported or not. */
>>>>>>>>> +typedef bool ovs_feature_parse_func(const struct smap
>>>>>> *ovs_capabilities,
>>>>>>>>> +                                    const char *cap_name);
>>>>>>>>> +
>>>>>>>>>  struct ovs_feature {
>>>>>>>>>      enum ovs_feature_value value;
>>>>>>>>>      const char *name;
>>>>>>>>> +    ovs_feature_parse_func *parse;
>>>>>>>>>  };
>>>>>>>>>
>>>>>>>>> +static bool
>>>>>>>>> +bool_parser(const struct smap *ovs_capabilities, const char
>>>> *cap_name)
>>>>>>>>> +{
>>>>>>>>> +    return smap_get_bool(ovs_capabilities, cap_name, false);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static bool
>>>>>>>>> +dp_hash_l4_sym_support_parser(const struct smap *ovs_capabilities,
>>>>>>>>> +                              const char *cap_name OVS_UNUSED)
>>>>>>>>> +{
>>>>>>>>> +    int max_hash_alg = smap_get_int(ovs_capabilities,
>>>> "max_hash_alg",
>>>>>> 0);
>>>>>>>>> +
>>>>>>>>> +    return max_hash_alg == OVS_HASH_ALG_SYM_L4;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>>  static struct ovs_feature all_ovs_features[] = {
>>>>>>>>>      {
>>>>>>>>>          .value = OVS_CT_ZERO_SNAT_SUPPORT,
>>>>>>>>> -        .name = "ct_zero_snat"
>>>>>>>>> +        .name = "ct_zero_snat",
>>>>>>>>> +        .parse = bool_parser,
>>>>>>>>>      },
>>>>>>>>>      {
>>>>>>>>>          .value = OVS_CT_TUPLE_FLUSH_SUPPORT,
>>>>>>>>> -        .name = "ct_flush"
>>>>>>>>> -    }
>>>>>>>>> +        .name = "ct_flush",
>>>>>>>>> +        .parse = bool_parser,
>>>>>>>>> +    },
>>>>>>>>> +    {
>>>>>>>>> +        .value = OVS_DP_HASH_L4_SYM_SUPPORT,
>>>>>>>>> +        .name = "dp_hash_l4_sym_support",
>>>>>>>>> +        .parse = dp_hash_l4_sym_support_parser,
>>>>>>>>> +    },
>>>>>>>>>  };
>>>>>>>>>
>>>>>>>>>  /* A bitmap of OVS features that have been detected as
>> 'supported'.
>>>> */
>>>>>>>>> @@ -65,6 +94,7 @@ ovs_feature_is_valid(enum ovs_feature_value
>>>> feature)
>>>>>>>>>      case OVS_CT_ZERO_SNAT_SUPPORT:
>>>>>>>>>      case OVS_DP_METER_SUPPORT:
>>>>>>>>>      case OVS_CT_TUPLE_FLUSH_SUPPORT:
>>>>>>>>> +    case OVS_DP_HASH_L4_SYM_SUPPORT:
>>>>>>>>>          return true;
>>>>>>>>>      default:
>>>>>>>>>          return false;
>>>>>>>>> @@ -183,18 +213,17 @@ ovs_feature_support_run(const struct smap
>>>>>>>>> *ovs_capabilities,
>>>>>>>>>      }
>>>>>>>>>
>>>>>>>>>      for (size_t i = 0; i < ARRAY_SIZE(all_ovs_features); i++) {
>>>>>>>>> -        enum ovs_feature_value value = all_ovs_features[i].value;
>>>>>>>>> -        const char *name = all_ovs_features[i].name;
>>>>>>>>> -        bool old_state = supported_ovs_features & value;
>>>>>>>>> -        bool new_state = smap_get_bool(ovs_capabilities, name,
>>>> false);
>>>>>>>>> +        struct ovs_feature *feature = &all_ovs_features[i];
>>>>>>>>> +        bool old_state = supported_ovs_features & feature->value;
>>>>>>>>> +        bool new_state = feature->parse(ovs_capabilities,
>>>>>> feature->name);
>>>>>>>>>          if (new_state != old_state) {
>>>>>>>>>              updated = true;
>>>>>>>>>              if (new_state) {
>>>>>>>>> -                supported_ovs_features |= value;
>>>>>>>>> +                supported_ovs_features |= feature->value;
>>>>>>>>>              } else {
>>>>>>>>> -                supported_ovs_features &= ~value;
>>>>>>>>> +                supported_ovs_features &= ~feature->value;
>>>>>>>>>              }
>>>>>>>>> -            VLOG_INFO_RL(&rl, "OVS Feature: %s, state: %s", name,
>>>>>>>>> +            VLOG_INFO_RL(&rl, "OVS Feature: %s, state: %s",
>>>>>> feature->name,
>>>>>>>>>                           new_state ? "supported" : "not
>> supported");
>>>>>>>>>          }
>>>>>>>>>      }
>>>>>>>>> --
>>>>>>>>> 2.31.1
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> dev mailing list
>>>>>>>>> d...@openvswitch.org
>>>>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Other than that it looks good.
>>>>>>>>
>>>>>>>> Acked-by: Ales Musil <amu...@redhat.com>
>>>>>>>>
>>>>>>>
>>>>>>> Thanks, I applied this to main.
>>>>>>>
>>>>>>> I was wondering if we should backport this though.  It's not exactly
>> a
>>>>>>> bug fix but it does avoid using a broken datapath feature (when
>>>>>>> possible) so it avoids a broken behavior.
>>>>>>>
>>>>>>> The changes are quite contained.
>>>>>>>
>>>>>>> CC-ing some of the maintainers explicitly to get some more eyes on
>>>> this.
>>>>>>>
>>>>>> Thanks Dumitru. Sorry that I was following the reported issue. I tried
>>>> to
>>>>>> look for some more details, but I am not authorized to access bug
>>>> 2175716.
>>>>>
>>>>> Sorry about that, that bug used to be public but some internal policies
>>>>> made it private.
>>>>>
>>>>>> It is not quite clear to me what the impact of this problem is. The
>> OVN
>>>>>> issue (112) seems to be a very common configuration of OVN ECMP and I
>>>>>> assume the problem is not easily reproduced because I have been
>>>>>> testing/deploying the feature and it worked as expected. Do you have
>>>> more
>>>>>> details about the trigger? Are there any related fixes in OVS and
>>>> kernel?
>>>>>>
>>>>>
>>>>> The original bug report came from OpenShift:
>>>>> https://issues.redhat.com/browse/OCPBUGS-7406
>>>>>
>>>>> CC-ing Patryk as he root caused the original issue.
>>>>>
>>>>> The simplified topology is:
>>>>>                               +--- GW1 ---+
>>>>> client ---- cluster-router ---+           +--- (out of OVN) --- server
>>>>>                               +--- GW2 ---+
>>>>>
>>>>> On cluster-router there's an ECMP route to reach 'server' via:
>>>>> - GW1
>>>>> - GW2
>>>>>
>>>>> On GW1 traffic going out of OVN is SNATed to IP1.
>>>>> On GW2 traffic going out of OVN is SNATed to IP2.
>>>>>
>>>>> The client initiates a TCP session towards the server; a path is chosen
>>>>> based on dp-hash, e.g. GW2.
>>>>>
>>>>> From this point on, during normal operation, traffic on this session
>>>>> flows fine via GW2.
>>>>>
>>>>> If the client closes the connection, then for the closing connection
>>>>> there's an exception when computing the dp-hash if the socket is in
>>>>> state TIME_WAIT, substate TCP_FIN_WAIT_2:
>>>>>
>> https://elixir.bootlin.com/linux/v6.2.1/source/net/ipv4/tcp_ipv4.c#L2218
>>>>> In this last case we call:
>>>>>
>>>>> tcp_v4_timewait_ack(sk, skb)
>>>>> |
>>>>> tcp_v4_send_ack(sk, skb, ...)
>>>>>
>>>>> The latter uses a global, per-cpu "ctl_sk" to build the last ACK
>> packet:
>>>>> https://elixir.bootlin.com/linux/v6.2.1/source/net/ipv4/tcp_ipv4.c#L925
>>>>>
>>>>>         ctl_sk = this_cpu_read(ipv4_tcp_sk);
>>>>>         sock_net_set(ctl_sk, net);
>>>>>         ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
>>>>>                            inet_twsk(sk)->tw_mark : sk->sk_mark;
>>>>>         ctl_sk->sk_priority = (sk->sk_state == TCP_TIME_WAIT) ?
>>>>>                            inet_twsk(sk)->tw_priority :
>> sk->sk_priority;
>>>>>         transmit_time = tcp_transmit_time(sk);
>>>>>         ip_send_unicast_reply(ctl_sk,
>>>>>                               skb, &TCP_SKB_CB(skb)->header.h4.opt,
>>>>>                               ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
>>>>>                               &arg, arg.iov[0].iov_len,
>>>>>                               transmit_time);
>>>>>
>>>>> As ctl_sk's sk_txhash is 0, when transmitting this last ack, its hash
>> is
>>>>> computed from the packet contents:
>>>>>
>>>>
>> https://elixir.bootlin.com/linux/v6.2.1/source/include/linux/skbuff.h#L1540
>>>>>
>>>>> static inline __u32 skb_get_hash(struct sk_buff *skb)
>>>>> {
>>>>>         if (!skb->l4_hash && !skb->sw_hash)
>>>>>                 __skb_get_hash(skb);
>>>>>
>>>>>         return skb->hash;
>>>>> }
>>>>> [...]
>>>>> void __skb_get_hash(struct sk_buff *skb)
>>>>> {
>>>>>         struct flow_keys keys;
>>>>>         u32 hash;
>>>>>
>>>>>         __flow_hash_secret_init();
>>>>>
>>>>>         hash = ___skb_get_hash(skb, &keys, &hashrnd);
>>>>>
>>>>>         __skb_set_sw_hash(skb, hash, flow_keys_have_l4(&keys));
>>>>> }
>>>>>
>>>>> But the original socket's txhash was actually a random hash:
>>>>> https://elixir.bootlin.com/linux/v6.2.1/source/include/net/sock.h#L2128
>>>>>
>>>>> static inline void sk_set_txhash(struct sock *sk)
>>>>> {
>>>>>         /* This pairs with READ_ONCE() in skb_set_hash_from_sk() */
>>>>>         WRITE_ONCE(sk->sk_txhash, net_tx_rndhash());
>>>>> }
>>>>>
>>>>> This means that the last ack might enter the ovs datapath with a
>>>>> different dp_hash than all other packets sent on that session in that
>>>>> direction. It's possible that this last ack gets hashed to a different
>>>>> path of the ECMP route.
>>>>>
>>>>> This specific issue was fixed in the kernel through:
>>>>>
>>>>>
>>>>
>> https://lore.kernel.org/all/20230511093456.672221-1-aten...@kernel.org/T/#mc84997c4cc755354e5527b41b7a15869536a21f4
>>>>>
>>>>> However, while debugging we realized the problem is actually visible
>>>>> with _any_ retransmit.  That's because a retransmit will trigger a
>> rehash:
>>>>>
>>>>>
>> https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_output.c#L4124
>>>>>
>>>>> int tcp_rtx_synack(const struct sock *sk, struct request_sock *req)
>>>>> {
>>>>>         const struct tcp_request_sock_ops *af_ops =
>>>> tcp_rsk(req)->af_specific;
>>>>>         struct flowi fl;
>>>>>         int res;
>>>>>
>>>>>         /* Paired with WRITE_ONCE() in sock_setsockopt() */
>>>>>         if (READ_ONCE(sk->sk_txrehash) == SOCK_TXREHASH_ENABLED)
>>>>>                 tcp_rsk(req)->txhash = net_tx_rndhash();
>>>>>         res = af_ops->send_synack(sk, NULL, &fl, req, NULL,
>>>> TCP_SYNACK_NORMAL,
>>>>>                                   NULL);
>>>>> [...]
>>>>>
>>>>> So suddenly all subsequent packets (skbs) will have a different 'hash'
>>>>> value and might get hashed to a different path.
>>>>>
>>>>> This last part was not fixed.  After a discussion on the netdev mailing
>>>>> list the conclusion was to not change the hash behavior for locally
>>>>> terminated TCP sessions:
>>>>>
>>>>>
>>>>
>> https://lore.kernel.org/all/20230511093456.672221-1-aten...@kernel.org/T/#m1fd8da6d912d5acead66c4c397396e6ac4b160dd
>>>>>
>>>>> Eventually the documentation change patch was dropped and a v2 of the
>>>>> other fixes was accepted:
>>>>> https://www.spinics.net/lists/netdev/msg906252.html
>>>>>
>>>>> The rehash on retransmit problem is still present though making the
>>>>> kernel's dp-hash implementation unusable.
>>>>>
>>>>> During some internal discussions Aaron (in CC) pointed out that OVS
>>>>> supports L4 symmetric hashing and that should be a stable L4 hash.  OVS
>>>>> already reports the maximum supported hashing algorithm through the
>>>>> Datapath.Capabilities.max_hash_alg ovsdb field.  The only problem was
>>>>> that the kernel datapath didn't support L4 symmetric hashing.  But
>> Aaron
>>>>> posted a patch to implement that and the patch got accepted:
>>>>>
>>>>> https://www.spinics.net/lists/netdev/msg911045.html
>>>>>
>>>>> Then we moved OVN to use L4 symmetric hashing if available.  That's to
>>>>> ensure correctness of the packet processing.  We did run performance
>>>>> tests and the hit due to the extra hashing was acceptable, max 3%
>>>>> throughput degradation in OVN scenarios that simulate common traffic
>>>>> patterns (like ovn-kubernetes ones) - in most cases the performance
>>>>> degradation was around 2%.
>>>>>
>>>>> Sorry for the long story but I hope it provides all the required
>>>>> context.  Thinking more about it I should've probably added some of
>> this
>>>>> to the original commit log, I apologize.
>>>>>
>>>>> Regards,
>>>>> Dumitru
>>>>>
>>>>
>>>> Thanks Dumitru for the details! It is now quite clear and I am not
>> against
>>>> backporting it.
>>>> But I have a question regarding upgrading. For the existing traffic,
>> does
>>>> this mean the packets may choose a different path after OVN upgrading?
>>>> Should this be called out?
>>>>
>>>
>>> Hi Han,
>>>
>>> That's a good question.  They might and this affects all traffic that's
>>> not locally originated for the kernel datapath and (potentially) all
>>> traffic for the userspace datapath.  I can post a follow up patch to
>>> update the NEWS file.  Did you have anything else in mind?
>>>
>> Sounds good. I don't have anything else.
>>
> 
> I posted
> https://patchwork.ozlabs.org/project/ovn/patch/20230814135401.2677054-1-dce...@redhat.com/
> 

I backported the commit and the NEWS update to branches 23.06 and 23.03.
 It didn't apply cleanly on older branches but I think that's fine for
now.  We can backport it further later if users request it.

Regards,
Dumitru

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to