Hello, Thanks for working on this.
The change for setting the selection fields for Load_Balancer has been merged upstream [1]. This change helped a lot and OVN Octavia provider driver (master) is now able to specify those fields [2] and as effect of it - is able to be properly tested with tempest tests [3]. The OVN Octavia bug described here [4], which is solved by the selection fields, is still valid for stable releases and the backport will help. The original behaviour wasn't changed, because if the selection fields in Load_Balancer row are not set, then it will still use 5-tuple hash as previously. Can I ask You to cherry-pick that change to the latest stable branch-20.03? [1] https://github.com/ovn-org/ovn/commit/5af304e7478adcf5ac50ed41e96a55bebebff3e8 [2] https://review.opendev.org/#/c/726787 [3] https://review.opendev.org/#/c/714004/ [4] https://bugs.launchpad.net/neutron/+bug/1871239 On Mon, May 4, 2020 at 3:40 AM Han Zhou <zhou...@gmail.com> wrote: > > > On Thu, Apr 30, 2020 at 4:12 AM Tonghao Zhang <xiangxia.m....@gmail.com> > wrote: > > > > On Thu, Apr 30, 2020 at 2:58 PM Han Zhou <zhou...@gmail.com> wrote: > > > > > > Thanks Maciej for testing. > > > > > > On Tue, Apr 28, 2020 at 5:36 AM Maciej Jozefczyk <mjoze...@redhat.com> > wrote: > > > > Here are my findings. > > > > > > > > 1) Test LB and sleep 1 second between calls: > > > > ./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 1 > > > > > > > > result: http://paste.openstack.org/show/792818/ > > > > Different backeds are selected and different buckets are being hit > in group_id=3. Sometimes the bucket1 is hit, sometimes bucket0. > > > > Output from groups dumps during the test: > http://paste.openstack.org/show/792820/ > > > > > > > > > > > > 2) Test LB and sleep 60 second between calls: > > > > ./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 60 > > > > > > > > Result: http://paste.openstack.org/show/792822/ > > > > Output from group stats: http://paste.openstack.org/show/792823/ > > > > Always one bucket is hit (bucket0) and requests are pointed to the > same backend. > > > > > > > > > > This test result proved our earlier analysis: different hash values > generated between kernel fastpath and userspace slowpath was the culprit. > > > > > > > > > > > > > > > On Fri, Apr 24, 2020 at 6:09 PM Ilya Maximets <i.maxim...@ovn.org> > wrote: > > > >> > > > >> On 4/24/20 3:19 AM, Han Zhou wrote: > > > >> > Based on the discussion in OVN meeting today I did some more > testing, and here are my findings. > > > >> > > > > >> > - With ICMP (ping) between same source and destination it is > always same bucket selected by dp_hash. > > > >> > - With "nc" specifying same TCP 5-tuples, the packets can end up > into different buckets. This is similar to what Numan and Maciej observed. > > > >> > > > > >> > However, I was using the OVN ECMP feature to test instead of LB. > Since ECMP feature doesn't use conntrack, here are some more findings. The > bucket selection changes only between 2 buckets, and the change happens > when the packet datapath changes between userspace and kernel datapath. > Let's say the first packet of a flow (megaflow) goes to userspace, it hits > bucket1, and then if I send the second and more packets immediately they > will all hit bucket2, but if I wait for a while until the flow disappears > from the megaflow cache and then send the next packet, it will hit bucket1 > again. This behavior is consistent. > > > >> > > > > >> > So I think it might be because of the different implementation of > dp_hash in userspace and kernel datapath, the different buckets were > selected (thanks Ilya for this hint in today's meeting). > > > >> > Numan/Maciej, in your tests did you see more than 2 buckets hit > for same 5-tuples? If the above theory is right, you should see at most 2 > buckets hit. For LB, since it uses CT and only the first packet uses the > group, all packets of the same flow would always be forwarded to same LB > backend. I guess if you wait long enough between the tests, you should see > all tests hitting same backend. It would be great if you could confirm this. > > > >> > > > > >> > For ECMP, this behavior will cause occasional packets out of > order even for a single flow (for a burst of packets after some idle time), > because CT is not used (and we can't use it because when peered with > physical ECMP router groups we can't ensure the return traffic from > physical routers hits same LR). > > > >> > > > > >> > For LB it causes the unexpected behavior that is reported in this > thread. > > > >> > > > > >> > For the fix, I think we should figure out how to make sure > dp_hash always uses same hash algorithm for both userspace and kernel > implementation, if possible. > > > >> > I am ok with the patch from Numan for the capability of > configuring the desired hash method instead of always using default. > However, using "hash" may be a huge performance sacrifice since the packets > are always handled in slowpath, especially for ECMP. Even though LB uses > CT, for short-lived flow scenario this is still a big performance penalty > (for long lived flows of LB it may be ok since the majority of packets are > still in fastpath). > > > >> > > > > >> > I am not familiar with the dp_hash implementation. I will do some > more study, but any idea on how to ensure the consistency of dp_hash is > highly appreciated! > > > >> > > > >> I had an impression that packet hash is routed from the datapath to > userspace > > > >> and back, but it turned out that it's really recent change. Seems > like following > > > >> changes are required: > > > >> > > > >> 1. Linux kernel: bd1903b7c459 ("net: openvswitch: add hash info to > upcall") > > > >> This is avaialble starting from upstream kernel v5.5. > > > >> > > > >> 2. OVS: 0442bfb11d6c ("ofproto-dpif-upcall: Echo HASH attribute > back to datapath.") > > > >> This is available on branch-2.13. > > > >> > > > >> With above two patches first and subsequent packets should have > same dp_hash > > > >> calculated by the kernel datapath. > > > >> > > > > > > Thanks Ilya for this information! Do you know why the kernel fix (1) > was not backported to OVS repo? Was it missed or on purpose? cc Tonghao > Zhang. > > > I had a test using ovn-fake-multinode, with OVS master, the problem > disappeared. However I am using an old kernel module and I don't think the > patch (1) is there. > > My most patches were applied in linux kernel upstream, and I don't > > know who backports that, but I am fine to backport it after holiday. > > Thanks. > > Thanks Tonghao! I backported your kernel patch and related fix here, > please take a look: > https://patchwork.ozlabs.org/project/openvswitch/patch/1588554154-30608-1-git-send-email-hz...@ovn.org/ > > I tested it on the same environment where I could reproduce the issue with > OVN ECMP. Here is the new behavior: > > - With a single connection (e.g. SSH), all packets are hitting the same > bucket now, now matter if datapath megaflow exist or not. This solves the > potential out of order problem for ECMP. > > - With "nc" to test same 5-tuple but different connections, still > different buckets were selected. I think it is probably because the hash > used now by dp_hash is generated by the socket, which is random, as > mentioned by the commit message of the kernel patch. It is not a problem > for ECMP, but it would not solve the problem initially brought up by this > thread, for the requirement of the special LB use cases. > > To solve the LB problem with dp_hash (as a further improvement), can we > support calculating hash in datapath instead of from skb, while using the > same mechanism used by this patch to ensure the same hash value is used for > the upcall handling? > > P.S. another problem for LB use case is that there could be many backends > (more than 64). The current dp_hash implementation has a limitation that if > there are more than 64 hash values required, it falls back to the original > "hash" algorithm, which satisfies the same 5-tuple, same bucket need but > could have performance problem. The limitation mainly comes from the > requirement of supporting different weight of the bucket, using the Webster > method. However, the OVN LB doesn't even use the weight (they are all > equal). > > Thanks, > Han > > > > >> > > > >> > > > > >> > Thanks, > > > >> > Han > > > >> > > > > >> > On Tue, Apr 21, 2020 at 1:05 AM Daniel Alvarez Sanchez < > dalva...@redhat.com <mailto:dalva...@redhat.com>> wrote: > > > >> >> > > > >> >> Thanks Numan for the investigation and the great explanation! > > > >> >> > > > >> >> On Tue, Apr 21, 2020 at 9:38 AM Numan Siddique <num...@ovn.org > <mailto:num...@ovn.org>> wrote: > > > >> >>> > > > >> >>> On Fri, Apr 17, 2020 at 12:56 PM Han Zhou <zhou...@gmail.com > <mailto:zhou...@gmail.com>> wrote: > > > >> >>> > > > > >> >>> > > > > >> >>> > > > > >> >>> > On Tue, Apr 7, 2020 at 7:03 AM Maciej Jozefczyk < > mjoze...@redhat.com <mailto:mjoze...@redhat.com>> wrote: > > > >> >>> > > > > > >> >>> > > Hello! > > > >> >>> > > > > > >> >>> > > I would like to ask you to clarify how the OVN Load > balancing algorithm works. > > > >> >>> > > > > > >> >>> > > Based on the action [1]: > > > >> >>> > > 1) If connection is alive the same 'backend' will be chosen, > > > >> >>> > > > > > >> >>> > > 2) If it is a new connection the backend will be chosen > based on selection_method=dp_hash [2]. > > > >> >>> > > Based on changelog the dp_hash uses '5 tuple hash' [3]. > > > >> >>> > > The hash is calculated based on values: source and > destination IP, source port, protocol and arbitrary value - 42. [4] > > > >> >>> > > Based on that information we could name it SOURCE_IP_PORT. > > > >> >>> > > > > > >> >>> > > Unfortunately we recently got a bug report in OVN Octavia > provider driver project, that the Load Balancing in OVN > > > >> >>> > > works differently [5]. The report shows even when the test > uses the same source ip and port, but new TCP connection, > > > >> >>> > > traffic is randomly distributed, but based on [2] it > shouldn't? > > > >> >>> > > > > > >> >>> > > Is it a bug? Is something else taken to account while > creating a hash? Can it be fixed in OVS/OVN? > > > >> >>> > > > > > >> >>> > > > > > >> >>> > > > > > >> >>> > > Thanks, > > > >> >>> > > Maciej > > > >> >>> > > > > > >> >>> > > > > > >> >>> > > [1] > https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1017 > > > >> >>> > > [2] > https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1059 > > > >> >>> > > [3] > https://github.com/openvswitch/ovs/blob/d58b59c17c70137aebdde37d3c01c26a26b28519/NEWS#L364-L371 > > > >> >>> > > [4] > https://github.com/openvswitch/ovs/blob/74286173f4d7f51f78e9db09b07a6d4d65263252/lib/flow.c#L2217 > > > >> >>> > > [5] https://bugs.launchpad.net/neutron/+bug/1871239 > > > >> >>> > > > > > >> >>> > > -- > > > >> >>> > > Best regards, > > > >> >>> > > Maciej Józefczyk > > > >> >>> > > _______________________________________________ > > > >> >>> > > discuss mailing list > > > >> >>> > > disc...@openvswitch.org <mailto:disc...@openvswitch.org> > > > >> >>> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > > >> >>> > > > > >> >>> > Hi Maciej, > > > >> >>> > > > > >> >>> > Thanks for reporting. It is definitely strange that same > 5-tuple flow resulted in hitting different backends. I didn't observed such > behavior before (maybe I should try again myself to confirm). Can you make > sure during the testing the group bucket didn't change? You can do so by: > > > >> >>> > # ovs-ofctl dump-groups br-int > > > >> >>> > and also check the group stats and see if multiple buckets > has counter increased during the test > > > >> >>> > # ovs-ofctl dump-group-stats br-int [group] > > > >> >>> > > > > >> >>> > For the 5-tuple hash function you are seeing > flow_hash_5tuple(), it is using all the 5-tuples. It adds both ports (src > and dst) at once: > > > >> >>> > /* Add both ports at once. */ > > > >> >>> > hash = hash_add(hash, > > > >> >>> > ((const uint32_t > *)flow)[offsetof(struct flow, tp_src) > > > >> >>> > / > sizeof(uint32_t)]); > > > >> >>> > > > > >> >>> > The tp_src is the start of the offset, and the size is 32, > meaning both src and dst, each is 16 bits. (Although I am not sure if > dp_hash method is using this function or not. Need to check more code) > > > >> >>> > > > > >> >>> > BTW, I am not sure why Neutron give it the name > SOURCE_IP_PORT. Shall it be called just 5-TUPLE, since protocol, > destination IP and PORT are also considered in the hash. > > > >> >>> > > > > >> >>> > > > >> >>> > > > >> >>> Hi Maciej and Han, > > > >> >>> > > > >> >>> I did some testing and I can confirm as you're saying. OVN is > not > > > >> >>> choosing the same backend with the src ip, src port fixed. > > > >> >>> > > > >> >>> I think there is an issue with OVN on how it is programming the > group > > > >> >>> flows. OVN is setting the selection_method as dp_hash. > > > >> >>> But when ovs-vswitchd receives the GROUP_MOD openflow message, > I > > > >> >>> noticed that the selection_method is not set. > > > >> >>> From the code I see that selection_method will be encoded only > if > > > >> >>> ovn-controller uses openflow version 1.5 [1] > > > >> >>> > > > >> >>> Since selection_method is NULL, vswitchd uses the dp_hash > method [2]. > > > >> >>> dp_hash means it uses the hash calculated by > > > >> >>> the datapath. In the case of kernel datapath, from what I > understand > > > >> >>> it uses skb_get_hash(). > > > >> >>> > > > >> >>> I modified the vswitchd code to use the selection_method "hash" > if > > > >> >>> selection_method is not set. In this case the load balancer > > > >> >>> works as expected. For a fixed src ip, src port, dst ip and dst > port, > > > >> >>> the group action is selecting the same bucket always. [3] > > > >> >>> > > > >> >>> I think we need to fix a few issues in OVN > > > >> >>> - Use openflow 1.5 so that ovn can set selection_method > > > >> >>> - Use "hash" method if dp_hash is not choosing the same > bucket for > > > >> >>> 5-tuple hash. > > > >> >>> - May be provide the option for the CMS to choose an > algorithm i.e. > > > >> >>> to use dp_hash or hash. > > > >> >>> > > > >> >> I'd rather not expose this to the CMS as it depends on the > datapath implementation as per [0] but maybe it makes sense to eventually > abstract it to the CMS in a more LB-ish way (common algorithm names used in > load balancing) in the case at some point the LB feature is enhanced > somehow to support more algorithms. > > > >> >> > > > >> >> I believe that for OVN LB users, using OF 1.5 to force the use > of 'hash' would be the best solution now. > > > >> >> > > > >> >> My 2 cents as I'm not an LB expert. > > > >> >> > > > >> >> I also recall that we tested this in the past and seemed to be > working. I have been checking further in the doc [0] and found this > paragraph: > > > >> >> > > > >> >> "If no selection method is specified, Open vSwitch up to release > 2.9 applies the hash method with default fields. From 2.10 onwards Open > vSwitch defaults to the dp_hash method with symmetric L3/L4 hash algorithm, > unless the weighted group buck‐ ets cannot be mapped to a maximum of 64 > dp_hash values with sufficient accuracy. In those rare cases Open vSwitch > 2.10 and later fall back to the hash method with the default set of hash > fields." > > > >> >> > > > >> >> The explanation seems to be that when we tested the feature we > relied on OVS 2.9 and hence the confusion. > > > >> >> > > > >> >> Thanks a lot again! > > > >> >> Daniel > > > >> >> > > > >> >> [0] > http://www.openvswitch.org/support/dist-docs/ovs-ofctl.8.html > > > >> >> > > > >> >>> > > > >> >>> I'll look into it on how to support this. > > > >> >>> > > > >> >>> [1] - > https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2120 > > > >> >>> > https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2082 > > > >> >>> > > > >> >>> [2] - > https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif.c#L5108 > > > >> >>> [3] - > https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-xlate.c#L4553 > > > >> >>> > > > >> >>> > > > >> >>> Thanks > > > >> >>> Numan > > > >> >>> > > > >> >>> > > > >> >>> > Thanks, > > > >> >>> > Han > > > >> >>> > _______________________________________________ > > > >> >>> > discuss mailing list > > > >> >>> > disc...@openvswitch.org <mailto:disc...@openvswitch.org> > > > >> >>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > > >> >>> _______________________________________________ > > > >> >>> discuss mailing list > > > >> >>> disc...@openvswitch.org <mailto:disc...@openvswitch.org> > > > >> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > > >> > > > >> _______________________________________________ > > > >> discuss mailing list > > > >> disc...@openvswitch.org > > > >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > Maciej Józefczyk > > > > > > > > -- > > Best regards, Tonghao > -- Best regards, Maciej Józefczyk
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss