On Thu, Apr 30, 2020 at 2:58 PM Han Zhou <[email protected]> wrote: > > Thanks Maciej for testing. > > On Tue, Apr 28, 2020 at 5:36 AM Maciej Jozefczyk <[email protected]> wrote: > > Here are my findings. > > > > 1) Test LB and sleep 1 second between calls: > > ./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 1 > > > > result: http://paste.openstack.org/show/792818/ > > Different backeds are selected and different buckets are being hit in > > group_id=3. Sometimes the bucket1 is hit, sometimes bucket0. > > Output from groups dumps during the test: > > http://paste.openstack.org/show/792820/ > > > > > > 2) Test LB and sleep 60 second between calls: > > ./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 60 > > > > Result: http://paste.openstack.org/show/792822/ > > Output from group stats: http://paste.openstack.org/show/792823/ > > Always one bucket is hit (bucket0) and requests are pointed to the same > > backend. > > > > This test result proved our earlier analysis: different hash values generated > between kernel fastpath and userspace slowpath was the culprit. > > > > > > > On Fri, Apr 24, 2020 at 6:09 PM Ilya Maximets <[email protected]> wrote: > >> > >> On 4/24/20 3:19 AM, Han Zhou wrote: > >> > Based on the discussion in OVN meeting today I did some more testing, > >> > and here are my findings. > >> > > >> > - With ICMP (ping) between same source and destination it is always same > >> > bucket selected by dp_hash. > >> > - With "nc" specifying same TCP 5-tuples, the packets can end up into > >> > different buckets. This is similar to what Numan and Maciej observed. > >> > > >> > However, I was using the OVN ECMP feature to test instead of LB. Since > >> > ECMP feature doesn't use conntrack, here are some more findings. The > >> > bucket selection changes only between 2 buckets, and the change happens > >> > when the packet datapath changes between userspace and kernel datapath. > >> > Let's say the first packet of a flow (megaflow) goes to userspace, it > >> > hits bucket1, and then if I send the second and more packets immediately > >> > they will all hit bucket2, but if I wait for a while until the flow > >> > disappears from the megaflow cache and then send the next packet, it > >> > will hit bucket1 again. This behavior is consistent. > >> > > >> > So I think it might be because of the different implementation of > >> > dp_hash in userspace and kernel datapath, the different buckets were > >> > selected (thanks Ilya for this hint in today's meeting). > >> > Numan/Maciej, in your tests did you see more than 2 buckets hit for same > >> > 5-tuples? If the above theory is right, you should see at most 2 buckets > >> > hit. For LB, since it uses CT and only the first packet uses the group, > >> > all packets of the same flow would always be forwarded to same LB > >> > backend. I guess if you wait long enough between the tests, you should > >> > see all tests hitting same backend. It would be great if you could > >> > confirm this. > >> > > >> > For ECMP, this behavior will cause occasional packets out of order even > >> > for a single flow (for a burst of packets after some idle time), because > >> > CT is not used (and we can't use it because when peered with physical > >> > ECMP router groups we can't ensure the return traffic from physical > >> > routers hits same LR). > >> > > >> > For LB it causes the unexpected behavior that is reported in this thread. > >> > > >> > For the fix, I think we should figure out how to make sure dp_hash > >> > always uses same hash algorithm for both userspace and kernel > >> > implementation, if possible. > >> > I am ok with the patch from Numan for the capability of configuring the > >> > desired hash method instead of always using default. However, using > >> > "hash" may be a huge performance sacrifice since the packets are always > >> > handled in slowpath, especially for ECMP. Even though LB uses CT, for > >> > short-lived flow scenario this is still a big performance penalty (for > >> > long lived flows of LB it may be ok since the majority of packets are > >> > still in fastpath). > >> > > >> > I am not familiar with the dp_hash implementation. I will do some more > >> > study, but any idea on how to ensure the consistency of dp_hash is > >> > highly appreciated! > >> > >> I had an impression that packet hash is routed from the datapath to > >> userspace > >> and back, but it turned out that it's really recent change. Seems like > >> following > >> changes are required: > >> > >> 1. Linux kernel: bd1903b7c459 ("net: openvswitch: add hash info to upcall") > >> This is avaialble starting from upstream kernel v5.5. > >> > >> 2. OVS: 0442bfb11d6c ("ofproto-dpif-upcall: Echo HASH attribute back to > >> datapath.") > >> This is available on branch-2.13. > >> > >> With above two patches first and subsequent packets should have same > >> dp_hash > >> calculated by the kernel datapath. > >> > > Thanks Ilya for this information! Do you know why the kernel fix (1) was not > backported to OVS repo? Was it missed or on purpose? cc Tonghao Zhang. > I had a test using ovn-fake-multinode, with OVS master, the problem > disappeared. However I am using an old kernel module and I don't think the > patch (1) is there. My most patches were applied in linux kernel upstream, and I don't know who backports that, but I am fine to backport it after holiday. Thanks. > >> > >> > > >> > Thanks, > >> > Han > >> > > >> > On Tue, Apr 21, 2020 at 1:05 AM Daniel Alvarez Sanchez > >> > <[email protected] <mailto:[email protected]>> wrote: > >> >> > >> >> Thanks Numan for the investigation and the great explanation! > >> >> > >> >> On Tue, Apr 21, 2020 at 9:38 AM Numan Siddique <[email protected] > >> >> <mailto:[email protected]>> wrote: > >> >>> > >> >>> On Fri, Apr 17, 2020 at 12:56 PM Han Zhou <[email protected] > >> >>> <mailto:[email protected]>> wrote: > >> >>> > > >> >>> > > >> >>> > > >> >>> > On Tue, Apr 7, 2020 at 7:03 AM Maciej Jozefczyk <[email protected] > >> >>> > <mailto:[email protected]>> wrote: > >> >>> > > > >> >>> > > Hello! > >> >>> > > > >> >>> > > I would like to ask you to clarify how the OVN Load balancing > >> >>> > > algorithm works. > >> >>> > > > >> >>> > > Based on the action [1]: > >> >>> > > 1) If connection is alive the same 'backend' will be chosen, > >> >>> > > > >> >>> > > 2) If it is a new connection the backend will be chosen based on > >> >>> > > selection_method=dp_hash [2]. > >> >>> > > Based on changelog the dp_hash uses '5 tuple hash' [3]. > >> >>> > > The hash is calculated based on values: source and destination IP, > >> >>> > > source port, protocol and arbitrary value - 42. [4] > >> >>> > > Based on that information we could name it SOURCE_IP_PORT. > >> >>> > > > >> >>> > > Unfortunately we recently got a bug report in OVN Octavia provider > >> >>> > > driver project, that the Load Balancing in OVN > >> >>> > > works differently [5]. The report shows even when the test uses > >> >>> > > the same source ip and port, but new TCP connection, > >> >>> > > traffic is randomly distributed, but based on [2] it shouldn't? > >> >>> > > > >> >>> > > Is it a bug? Is something else taken to account while creating a > >> >>> > > hash? Can it be fixed in OVS/OVN? > >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > Thanks, > >> >>> > > Maciej > >> >>> > > > >> >>> > > > >> >>> > > [1] > >> >>> > > https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1017 > >> >>> > > [2] > >> >>> > > https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1059 > >> >>> > > [3] > >> >>> > > https://github.com/openvswitch/ovs/blob/d58b59c17c70137aebdde37d3c01c26a26b28519/NEWS#L364-L371 > >> >>> > > [4] > >> >>> > > https://github.com/openvswitch/ovs/blob/74286173f4d7f51f78e9db09b07a6d4d65263252/lib/flow.c#L2217 > >> >>> > > [5] https://bugs.launchpad.net/neutron/+bug/1871239 > >> >>> > > > >> >>> > > -- > >> >>> > > Best regards, > >> >>> > > Maciej Józefczyk > >> >>> > > _______________________________________________ > >> >>> > > discuss mailing list > >> >>> > > [email protected] <mailto:[email protected]> > >> >>> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > >> >>> > > >> >>> > Hi Maciej, > >> >>> > > >> >>> > Thanks for reporting. It is definitely strange that same 5-tuple > >> >>> > flow resulted in hitting different backends. I didn't observed such > >> >>> > behavior before (maybe I should try again myself to confirm). Can > >> >>> > you make sure during the testing the group bucket didn't change? You > >> >>> > can do so by: > >> >>> > # ovs-ofctl dump-groups br-int > >> >>> > and also check the group stats and see if multiple buckets has > >> >>> > counter increased during the test > >> >>> > # ovs-ofctl dump-group-stats br-int [group] > >> >>> > > >> >>> > For the 5-tuple hash function you are seeing flow_hash_5tuple(), it > >> >>> > is using all the 5-tuples. It adds both ports (src and dst) at once: > >> >>> > /* Add both ports at once. */ > >> >>> > hash = hash_add(hash, > >> >>> > ((const uint32_t *)flow)[offsetof(struct > >> >>> > flow, tp_src) > >> >>> > / > >> >>> > sizeof(uint32_t)]); > >> >>> > > >> >>> > The tp_src is the start of the offset, and the size is 32, meaning > >> >>> > both src and dst, each is 16 bits. (Although I am not sure if > >> >>> > dp_hash method is using this function or not. Need to check more > >> >>> > code) > >> >>> > > >> >>> > BTW, I am not sure why Neutron give it the name SOURCE_IP_PORT. > >> >>> > Shall it be called just 5-TUPLE, since protocol, destination IP and > >> >>> > PORT are also considered in the hash. > >> >>> > > >> >>> > >> >>> > >> >>> Hi Maciej and Han, > >> >>> > >> >>> I did some testing and I can confirm as you're saying. OVN is not > >> >>> choosing the same backend with the src ip, src port fixed. > >> >>> > >> >>> I think there is an issue with OVN on how it is programming the group > >> >>> flows. OVN is setting the selection_method as dp_hash. > >> >>> But when ovs-vswitchd receives the GROUP_MOD openflow message, I > >> >>> noticed that the selection_method is not set. > >> >>> From the code I see that selection_method will be encoded only if > >> >>> ovn-controller uses openflow version 1.5 [1] > >> >>> > >> >>> Since selection_method is NULL, vswitchd uses the dp_hash method [2]. > >> >>> dp_hash means it uses the hash calculated by > >> >>> the datapath. In the case of kernel datapath, from what I understand > >> >>> it uses skb_get_hash(). > >> >>> > >> >>> I modified the vswitchd code to use the selection_method "hash" if > >> >>> selection_method is not set. In this case the load balancer > >> >>> works as expected. For a fixed src ip, src port, dst ip and dst port, > >> >>> the group action is selecting the same bucket always. [3] > >> >>> > >> >>> I think we need to fix a few issues in OVN > >> >>> - Use openflow 1.5 so that ovn can set selection_method > >> >>> - Use "hash" method if dp_hash is not choosing the same bucket for > >> >>> 5-tuple hash. > >> >>> - May be provide the option for the CMS to choose an algorithm i.e. > >> >>> to use dp_hash or hash. > >> >>> > >> >> I'd rather not expose this to the CMS as it depends on the datapath > >> >> implementation as per [0] but maybe it makes sense to eventually > >> >> abstract it to the CMS in a more LB-ish way (common algorithm names > >> >> used in load balancing) in the case at some point the LB feature is > >> >> enhanced somehow to support more algorithms. > >> >> > >> >> I believe that for OVN LB users, using OF 1.5 to force the use of > >> >> 'hash' would be the best solution now. > >> >> > >> >> My 2 cents as I'm not an LB expert. > >> >> > >> >> I also recall that we tested this in the past and seemed to be working. > >> >> I have been checking further in the doc [0] and found this paragraph: > >> >> > >> >> "If no selection method is specified, Open vSwitch up to release 2.9 > >> >> applies the hash method with default fields. From 2.10 onwards Open > >> >> vSwitch defaults to the dp_hash method with symmetric L3/L4 hash > >> >> algorithm, unless the weighted group buck‐ ets cannot be mapped to a > >> >> maximum of 64 dp_hash values with sufficient accuracy. In those rare > >> >> cases Open vSwitch 2.10 and later fall back to the hash method with the > >> >> default set of hash fields." > >> >> > >> >> The explanation seems to be that when we tested the feature we relied > >> >> on OVS 2.9 and hence the confusion. > >> >> > >> >> Thanks a lot again! > >> >> Daniel > >> >> > >> >> [0] http://www.openvswitch.org/support/dist-docs/ovs-ofctl.8.html > >> >> > >> >>> > >> >>> I'll look into it on how to support this. > >> >>> > >> >>> [1] - > >> >>> https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2120 > >> >>> > >> >>> https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2082 > >> >>> > >> >>> [2] - > >> >>> https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif.c#L5108 > >> >>> [3] - > >> >>> https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-xlate.c#L4553 > >> >>> > >> >>> > >> >>> Thanks > >> >>> Numan > >> >>> > >> >>> > >> >>> > Thanks, > >> >>> > Han > >> >>> > _______________________________________________ > >> >>> > discuss mailing list > >> >>> > [email protected] <mailto:[email protected]> > >> >>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > >> >>> _______________________________________________ > >> >>> discuss mailing list > >> >>> [email protected] <mailto:[email protected]> > >> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > >> > >> _______________________________________________ > >> discuss mailing list > >> [email protected] > >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > > > > > > > -- > > Best regards, > > Maciej Józefczyk
-- Best regards, Tonghao _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
