On Fri, Aug 11, 2023 at 6:23 AM Dumitru Ceara <dce...@redhat.com> wrote:
>
> On 8/1/23 12:51, Han Zhou wrote:
> > On Tue, Jul 18, 2023 at 6:16 AM Dumitru Ceara <dce...@redhat.com> wrote:
> >>
> >> On 7/18/23 12:14, Han Zhou wrote:
> >>> On Mon, Jul 17, 2023 at 9:51 PM Dumitru Ceara <dce...@redhat.com>
wrote:
> >>>>
> >>>> On 7/14/23 08:41, Ales Musil wrote:
> >>>>> On Thu, Jul 13, 2023 at 4:39 PM Dumitru Ceara <dce...@redhat.com>
> > wrote:
> >>>>>
> >>>>>> Regular dp-hash is not a canonical L4 hash (at least with the
netlink
> >>>>>> datapath).  If the datapath supports l4 symmetrical dp-hash use
that
> >>> one
> >>>>>> instead.
> >>>>>>
> >>>>>> Reported-at: https://github.com/ovn-org/ovn/issues/112
> >>>>>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2188679
> >>>>>> Signed-off-by: Dumitru Ceara <dce...@redhat.com>
> >>>>>>
> >>>>>
> >>>>> Hi Dumitru,
> >>>>>
> >>>>>
> >>>>>> ---
> >>>>>>  include/ovn/features.h |  2 ++
> >>>>>>  lib/actions.c          |  6 ++++++
> >>>>>>  lib/features.c         | 49
> > +++++++++++++++++++++++++++++++++---------
> >>>>>>  3 files changed, 47 insertions(+), 10 deletions(-)
> >>>>>>
> >>>>>> diff --git a/include/ovn/features.h b/include/ovn/features.h
> >>>>>> index de8f1f5489..3bf536127f 100644
> >>>>>> --- a/include/ovn/features.h
> >>>>>> +++ b/include/ovn/features.h
> >>>>>> @@ -34,12 +34,14 @@ enum ovs_feature_support_bits {
> >>>>>>      OVS_CT_ZERO_SNAT_SUPPORT_BIT,
> >>>>>>      OVS_DP_METER_SUPPORT_BIT,
> >>>>>>      OVS_CT_TUPLE_FLUSH_BIT,
> >>>>>> +    OVS_DP_HASH_L4_SYM_BIT,
> >>>>>>  };
> >>>>>>
> >>>>>>  enum ovs_feature_value {
> >>>>>>      OVS_CT_ZERO_SNAT_SUPPORT = (1 <<
OVS_CT_ZERO_SNAT_SUPPORT_BIT),
> >>>>>>      OVS_DP_METER_SUPPORT = (1 << OVS_DP_METER_SUPPORT_BIT),
> >>>>>>      OVS_CT_TUPLE_FLUSH_SUPPORT = (1 << OVS_CT_TUPLE_FLUSH_BIT),
> >>>>>> +    OVS_DP_HASH_L4_SYM_SUPPORT = (1 << OVS_DP_HASH_L4_SYM_BIT),
> >>>>>>  };
> >>>>>>
> >>>>>>  void ovs_feature_support_destroy(void);
> >>>>>> diff --git a/lib/actions.c b/lib/actions.c
> >>>>>> index 037172e606..9d52cb75a8 100644
> >>>>>> --- a/lib/actions.c
> >>>>>> +++ b/lib/actions.c
> >>>>>> @@ -1625,6 +1625,12 @@ encode_SELECT(const struct ovnact_select
> >>> *select,
> >>>>>>      struct ds ds = DS_EMPTY_INITIALIZER;
> >>>>>>      ds_put_format(&ds, "type=select,selection_method=dp_hash");
> >>>>>>
> >>>>>> +    if (ovs_feature_is_supported(OVS_DP_HASH_L4_SYM_SUPPORT)) {
> >>>>>> +        /* Select dp-hash l4_symmetric by setting the upper 32bits
> > of
> >>>>>> +         * selection_method_param to 1: */
> >>>>>>
> >>>>>
> >>>>> This comment is a bit unfortunate, because it reads like you want to
> > set
> >>>>> all bits of the upper half
> >>>>> to 1 e.g. 0xffffffff. Maybe change it to: "selection_method_param to
> >>> value
> >>>>> 1 (1 << 32)." during merge, wdyt?
> >>>>>
> >>>>>
> >>>>
> >>>> Makes sense, thanks for the suggestion!  I changed it accordingly.
> >>>>
> >>>>>> +        ds_put_cstr(&ds, ",selection_method_param=0x100000000");
> >>>>>> +    }
> >>>>>> +
> >>>>>>      struct mf_subfield sf =
expr_resolve_field(&select->res_field);
> >>>>>>
> >>>>>>      for (size_t bucket_id = 0; bucket_id < select->n_dsts;
> >>> bucket_id++) {
> >>>>>> diff --git a/lib/features.c b/lib/features.c
> >>>>>> index 97c9c86f00..d24e8f6c5c 100644
> >>>>>> --- a/lib/features.c
> >>>>>> +++ b/lib/features.c
> >>>>>> @@ -21,6 +21,7 @@
> >>>>>>  #include "lib/dirs.h"
> >>>>>>  #include "socket-util.h"
> >>>>>>  #include "lib/vswitch-idl.h"
> >>>>>> +#include "odp-netlink.h"
> >>>>>>  #include "openvswitch/vlog.h"
> >>>>>>  #include "openvswitch/ofpbuf.h"
> >>>>>>  #include "openvswitch/rconn.h"
> >>>>>> @@ -33,20 +34,48 @@ VLOG_DEFINE_THIS_MODULE(features);
> >>>>>>
> >>>>>>  #define FEATURES_DEFAULT_PROBE_INTERVAL_SEC 5
> >>>>>>
> >>>>>> +/* Parses 'cap_name' from 'ovs_capabilities' and returns whether
the
> >>>>>> + * type of capability is supported or not. */
> >>>>>> +typedef bool ovs_feature_parse_func(const struct smap
> >>> *ovs_capabilities,
> >>>>>> +                                    const char *cap_name);
> >>>>>> +
> >>>>>>  struct ovs_feature {
> >>>>>>      enum ovs_feature_value value;
> >>>>>>      const char *name;
> >>>>>> +    ovs_feature_parse_func *parse;
> >>>>>>  };
> >>>>>>
> >>>>>> +static bool
> >>>>>> +bool_parser(const struct smap *ovs_capabilities, const char
> > *cap_name)
> >>>>>> +{
> >>>>>> +    return smap_get_bool(ovs_capabilities, cap_name, false);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool
> >>>>>> +dp_hash_l4_sym_support_parser(const struct smap *ovs_capabilities,
> >>>>>> +                              const char *cap_name OVS_UNUSED)
> >>>>>> +{
> >>>>>> +    int max_hash_alg = smap_get_int(ovs_capabilities,
> > "max_hash_alg",
> >>> 0);
> >>>>>> +
> >>>>>> +    return max_hash_alg == OVS_HASH_ALG_SYM_L4;
> >>>>>> +}
> >>>>>> +
> >>>>>>  static struct ovs_feature all_ovs_features[] = {
> >>>>>>      {
> >>>>>>          .value = OVS_CT_ZERO_SNAT_SUPPORT,
> >>>>>> -        .name = "ct_zero_snat"
> >>>>>> +        .name = "ct_zero_snat",
> >>>>>> +        .parse = bool_parser,
> >>>>>>      },
> >>>>>>      {
> >>>>>>          .value = OVS_CT_TUPLE_FLUSH_SUPPORT,
> >>>>>> -        .name = "ct_flush"
> >>>>>> -    }
> >>>>>> +        .name = "ct_flush",
> >>>>>> +        .parse = bool_parser,
> >>>>>> +    },
> >>>>>> +    {
> >>>>>> +        .value = OVS_DP_HASH_L4_SYM_SUPPORT,
> >>>>>> +        .name = "dp_hash_l4_sym_support",
> >>>>>> +        .parse = dp_hash_l4_sym_support_parser,
> >>>>>> +    },
> >>>>>>  };
> >>>>>>
> >>>>>>  /* A bitmap of OVS features that have been detected as
'supported'.
> > */
> >>>>>> @@ -65,6 +94,7 @@ ovs_feature_is_valid(enum ovs_feature_value
> > feature)
> >>>>>>      case OVS_CT_ZERO_SNAT_SUPPORT:
> >>>>>>      case OVS_DP_METER_SUPPORT:
> >>>>>>      case OVS_CT_TUPLE_FLUSH_SUPPORT:
> >>>>>> +    case OVS_DP_HASH_L4_SYM_SUPPORT:
> >>>>>>          return true;
> >>>>>>      default:
> >>>>>>          return false;
> >>>>>> @@ -183,18 +213,17 @@ ovs_feature_support_run(const struct smap
> >>>>>> *ovs_capabilities,
> >>>>>>      }
> >>>>>>
> >>>>>>      for (size_t i = 0; i < ARRAY_SIZE(all_ovs_features); i++) {
> >>>>>> -        enum ovs_feature_value value = all_ovs_features[i].value;
> >>>>>> -        const char *name = all_ovs_features[i].name;
> >>>>>> -        bool old_state = supported_ovs_features & value;
> >>>>>> -        bool new_state = smap_get_bool(ovs_capabilities, name,
> > false);
> >>>>>> +        struct ovs_feature *feature = &all_ovs_features[i];
> >>>>>> +        bool old_state = supported_ovs_features & feature->value;
> >>>>>> +        bool new_state = feature->parse(ovs_capabilities,
> >>> feature->name);
> >>>>>>          if (new_state != old_state) {
> >>>>>>              updated = true;
> >>>>>>              if (new_state) {
> >>>>>> -                supported_ovs_features |= value;
> >>>>>> +                supported_ovs_features |= feature->value;
> >>>>>>              } else {
> >>>>>> -                supported_ovs_features &= ~value;
> >>>>>> +                supported_ovs_features &= ~feature->value;
> >>>>>>              }
> >>>>>> -            VLOG_INFO_RL(&rl, "OVS Feature: %s, state: %s", name,
> >>>>>> +            VLOG_INFO_RL(&rl, "OVS Feature: %s, state: %s",
> >>> feature->name,
> >>>>>>                           new_state ? "supported" : "not
supported");
> >>>>>>          }
> >>>>>>      }
> >>>>>> --
> >>>>>> 2.31.1
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> dev mailing list
> >>>>>> d...@openvswitch.org
> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>>>>>
> >>>>>>
> >>>>> Other than that it looks good.
> >>>>>
> >>>>> Acked-by: Ales Musil <amu...@redhat.com>
> >>>>>
> >>>>
> >>>> Thanks, I applied this to main.
> >>>>
> >>>> I was wondering if we should backport this though.  It's not exactly
a
> >>>> bug fix but it does avoid using a broken datapath feature (when
> >>>> possible) so it avoids a broken behavior.
> >>>>
> >>>> The changes are quite contained.
> >>>>
> >>>> CC-ing some of the maintainers explicitly to get some more eyes on
> > this.
> >>>>
> >>> Thanks Dumitru. Sorry that I was following the reported issue. I tried
> > to
> >>> look for some more details, but I am not authorized to access bug
> > 2175716.
> >>
> >> Sorry about that, that bug used to be public but some internal policies
> >> made it private.
> >>
> >>> It is not quite clear to me what the impact of this problem is. The
OVN
> >>> issue (112) seems to be a very common configuration of OVN ECMP and I
> >>> assume the problem is not easily reproduced because I have been
> >>> testing/deploying the feature and it worked as expected. Do you have
> > more
> >>> details about the trigger? Are there any related fixes in OVS and
> > kernel?
> >>>
> >>
> >> The original bug report came from OpenShift:
> >> https://issues.redhat.com/browse/OCPBUGS-7406
> >>
> >> CC-ing Patryk as he root caused the original issue.
> >>
> >> The simplified topology is:
> >>                               +--- GW1 ---+
> >> client ---- cluster-router ---+           +--- (out of OVN) --- server
> >>                               +--- GW2 ---+
> >>
> >> On cluster-router there's an ECMP route to reach 'server' via:
> >> - GW1
> >> - GW2
> >>
> >> On GW1 traffic going out of OVN is SNATed to IP1.
> >> On GW2 traffic going out of OVN is SNATed to IP2.
> >>
> >> The client initiates a TCP session towards the server; a path is chosen
> >> based on dp-hash, e.g. GW2.
> >>
> >> From this point on, during normal operation, traffic on this session
> >> flows fine via GW2.
> >>
> >> If the client closes the connection, then for the closing connection
> >> there's an exception when computing the dp-hash if the socket is in
> >> state TIME_WAIT, substate TCP_FIN_WAIT_2:
> >>
https://elixir.bootlin.com/linux/v6.2.1/source/net/ipv4/tcp_ipv4.c#L2218
> >> In this last case we call:
> >>
> >> tcp_v4_timewait_ack(sk, skb)
> >> |
> >> tcp_v4_send_ack(sk, skb, ...)
> >>
> >> The latter uses a global, per-cpu "ctl_sk" to build the last ACK
packet:
> >> https://elixir.bootlin.com/linux/v6.2.1/source/net/ipv4/tcp_ipv4.c#L925
> >>
> >>         ctl_sk = this_cpu_read(ipv4_tcp_sk);
> >>         sock_net_set(ctl_sk, net);
> >>         ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
> >>                            inet_twsk(sk)->tw_mark : sk->sk_mark;
> >>         ctl_sk->sk_priority = (sk->sk_state == TCP_TIME_WAIT) ?
> >>                            inet_twsk(sk)->tw_priority :
sk->sk_priority;
> >>         transmit_time = tcp_transmit_time(sk);
> >>         ip_send_unicast_reply(ctl_sk,
> >>                               skb, &TCP_SKB_CB(skb)->header.h4.opt,
> >>                               ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
> >>                               &arg, arg.iov[0].iov_len,
> >>                               transmit_time);
> >>
> >> As ctl_sk's sk_txhash is 0, when transmitting this last ack, its hash
is
> >> computed from the packet contents:
> >>
> >
https://elixir.bootlin.com/linux/v6.2.1/source/include/linux/skbuff.h#L1540
> >>
> >> static inline __u32 skb_get_hash(struct sk_buff *skb)
> >> {
> >>         if (!skb->l4_hash && !skb->sw_hash)
> >>                 __skb_get_hash(skb);
> >>
> >>         return skb->hash;
> >> }
> >> [...]
> >> void __skb_get_hash(struct sk_buff *skb)
> >> {
> >>         struct flow_keys keys;
> >>         u32 hash;
> >>
> >>         __flow_hash_secret_init();
> >>
> >>         hash = ___skb_get_hash(skb, &keys, &hashrnd);
> >>
> >>         __skb_set_sw_hash(skb, hash, flow_keys_have_l4(&keys));
> >> }
> >>
> >> But the original socket's txhash was actually a random hash:
> >> https://elixir.bootlin.com/linux/v6.2.1/source/include/net/sock.h#L2128
> >>
> >> static inline void sk_set_txhash(struct sock *sk)
> >> {
> >>         /* This pairs with READ_ONCE() in skb_set_hash_from_sk() */
> >>         WRITE_ONCE(sk->sk_txhash, net_tx_rndhash());
> >> }
> >>
> >> This means that the last ack might enter the ovs datapath with a
> >> different dp_hash than all other packets sent on that session in that
> >> direction. It's possible that this last ack gets hashed to a different
> >> path of the ECMP route.
> >>
> >> This specific issue was fixed in the kernel through:
> >>
> >>
> >
https://lore.kernel.org/all/20230511093456.672221-1-aten...@kernel.org/T/#mc84997c4cc755354e5527b41b7a15869536a21f4
> >>
> >> However, while debugging we realized the problem is actually visible
> >> with _any_ retransmit.  That's because a retransmit will trigger a
rehash:
> >>
> >>
https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_output.c#L4124
> >>
> >> int tcp_rtx_synack(const struct sock *sk, struct request_sock *req)
> >> {
> >>         const struct tcp_request_sock_ops *af_ops =
> > tcp_rsk(req)->af_specific;
> >>         struct flowi fl;
> >>         int res;
> >>
> >>         /* Paired with WRITE_ONCE() in sock_setsockopt() */
> >>         if (READ_ONCE(sk->sk_txrehash) == SOCK_TXREHASH_ENABLED)
> >>                 tcp_rsk(req)->txhash = net_tx_rndhash();
> >>         res = af_ops->send_synack(sk, NULL, &fl, req, NULL,
> > TCP_SYNACK_NORMAL,
> >>                                   NULL);
> >> [...]
> >>
> >> So suddenly all subsequent packets (skbs) will have a different 'hash'
> >> value and might get hashed to a different path.
> >>
> >> This last part was not fixed.  After a discussion on the netdev mailing
> >> list the conclusion was to not change the hash behavior for locally
> >> terminated TCP sessions:
> >>
> >>
> >
https://lore.kernel.org/all/20230511093456.672221-1-aten...@kernel.org/T/#m1fd8da6d912d5acead66c4c397396e6ac4b160dd
> >>
> >> Eventually the documentation change patch was dropped and a v2 of the
> >> other fixes was accepted:
> >> https://www.spinics.net/lists/netdev/msg906252.html
> >>
> >> The rehash on retransmit problem is still present though making the
> >> kernel's dp-hash implementation unusable.
> >>
> >> During some internal discussions Aaron (in CC) pointed out that OVS
> >> supports L4 symmetric hashing and that should be a stable L4 hash.  OVS
> >> already reports the maximum supported hashing algorithm through the
> >> Datapath.Capabilities.max_hash_alg ovsdb field.  The only problem was
> >> that the kernel datapath didn't support L4 symmetric hashing.  But
Aaron
> >> posted a patch to implement that and the patch got accepted:
> >>
> >> https://www.spinics.net/lists/netdev/msg911045.html
> >>
> >> Then we moved OVN to use L4 symmetric hashing if available.  That's to
> >> ensure correctness of the packet processing.  We did run performance
> >> tests and the hit due to the extra hashing was acceptable, max 3%
> >> throughput degradation in OVN scenarios that simulate common traffic
> >> patterns (like ovn-kubernetes ones) - in most cases the performance
> >> degradation was around 2%.
> >>
> >> Sorry for the long story but I hope it provides all the required
> >> context.  Thinking more about it I should've probably added some of
this
> >> to the original commit log, I apologize.
> >>
> >> Regards,
> >> Dumitru
> >>
> >
> > Thanks Dumitru for the details! It is now quite clear and I am not
against
> > backporting it.
> > But I have a question regarding upgrading. For the existing traffic,
does
> > this mean the packets may choose a different path after OVN upgrading?
> > Should this be called out?
> >
>
> Hi Han,
>
> That's a good question.  They might and this affects all traffic that's
> not locally originated for the kernel datapath and (potentially) all
> traffic for the userspace datapath.  I can post a follow up patch to
> update the NEWS file.  Did you have anything else in mind?
>
Sounds good. I don't have anything else.

Thanks,
Han

> Regards,
> Dumitru
>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to