Thanks Maciej for testing.

On Tue, Apr 28, 2020 at 5:36 AM Maciej Jozefczyk <[email protected]>
wrote:
> Here are my findings.
>
> 1) Test LB and sleep 1 second between calls:
> ./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 1
>
> result: http://paste.openstack.org/show/792818/
> Different backeds are selected and different buckets are being hit in
group_id=3. Sometimes the bucket1 is hit, sometimes bucket0.
> Output from groups dumps during the test:
http://paste.openstack.org/show/792820/
>
>
> 2) Test LB and sleep 60 second between calls:
> ./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 60
>
> Result: http://paste.openstack.org/show/792822/
> Output from group stats: http://paste.openstack.org/show/792823/
> Always one bucket is hit (bucket0) and requests are pointed to the same
backend.
>

This test result proved our earlier analysis: different hash values
generated between kernel fastpath and userspace slowpath was the culprit.

>
>
> On Fri, Apr 24, 2020 at 6:09 PM Ilya Maximets <[email protected]> wrote:
>>
>> On 4/24/20 3:19 AM, Han Zhou wrote:
>> > Based on the discussion in OVN meeting today I did some more testing,
and here are my findings.
>> >
>> > - With ICMP (ping) between same source and destination it is always
same bucket selected by dp_hash.
>> > - With "nc" specifying same TCP 5-tuples, the packets can end up into
different buckets. This is similar to what Numan and Maciej observed.
>> >
>> > However, I was using the OVN ECMP feature to test instead of LB. Since
ECMP feature doesn't use conntrack, here are some more findings. The bucket
selection changes only between 2 buckets, and the change happens when the
packet datapath changes between userspace and kernel datapath. Let's say
the first packet of a flow (megaflow) goes to userspace, it hits bucket1,
and then if I send the second and more packets immediately they will all
hit bucket2, but if I wait for a while until the flow disappears from the
megaflow cache and then send the next packet, it will hit bucket1 again.
This behavior is consistent.
>> >
>> > So I think it might be because of the different implementation of
dp_hash in userspace and kernel datapath, the different buckets were
selected (thanks Ilya for this hint in today's meeting).
>> > Numan/Maciej, in your tests did you see more than 2 buckets hit for
same 5-tuples? If the above theory is right, you should see at most 2
buckets hit. For LB, since it uses CT and only the first packet uses the
group, all packets of the same flow would always be forwarded to same LB
backend. I guess if you wait long enough between the tests, you should see
all tests hitting same backend. It would be great if you could confirm this.
>> >
>> > For ECMP, this behavior will cause occasional packets out of order
even for a single flow (for a burst of packets after some idle time),
because CT is not used (and we can't use it because when peered with
physical ECMP router groups we can't ensure the return traffic from
physical routers hits same LR).
>> >
>> > For LB it causes the unexpected behavior that is reported in this
thread.
>> >
>> > For the fix, I think we should figure out how to make sure dp_hash
always uses same hash algorithm for both userspace and kernel
implementation, if possible.
>> > I am ok with the patch from Numan for the capability of configuring
the desired hash method instead of always using default. However, using
"hash" may be a huge performance sacrifice since the packets are always
handled in slowpath, especially for ECMP. Even though LB uses CT, for
short-lived flow scenario this is still a big performance penalty (for long
lived flows of LB it may be ok since the majority of packets are still in
fastpath).
>> >
>> > I am not familiar with the dp_hash implementation. I will do some more
study, but any idea on how to ensure the consistency of dp_hash is highly
appreciated!
>>
>> I had an impression that packet hash is routed from the datapath to
userspace
>> and back, but it turned out that it's really recent change.  Seems like
following
>> changes are required:
>>
>> 1. Linux kernel: bd1903b7c459 ("net: openvswitch: add hash info to
upcall")
>>    This is avaialble starting from upstream kernel v5.5.
>>
>> 2. OVS: 0442bfb11d6c ("ofproto-dpif-upcall: Echo HASH attribute back to
datapath.")
>>    This is available on branch-2.13.
>>
>> With above two patches first and subsequent packets should have same
dp_hash
>> calculated by the kernel datapath.
>>

Thanks Ilya for this information! Do you know why the kernel fix (1) was
not backported to OVS repo? Was it missed or on purpose? cc Tonghao Zhang.
I had a test using ovn-fake-multinode, with OVS master, the problem
disappeared. However I am using an old kernel module and I don't think the
patch (1) is there.

>>
>> >
>> > Thanks,
>> > Han
>> >
>> > On Tue, Apr 21, 2020 at 1:05 AM Daniel Alvarez Sanchez <
[email protected] <mailto:[email protected]>> wrote:
>> >>
>> >> Thanks Numan for the investigation and the great explanation!
>> >>
>> >> On Tue, Apr 21, 2020 at 9:38 AM Numan Siddique <[email protected]
<mailto:[email protected]>> wrote:
>> >>>
>> >>> On Fri, Apr 17, 2020 at 12:56 PM Han Zhou <[email protected] <mailto:
[email protected]>> wrote:
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Tue, Apr 7, 2020 at 7:03 AM Maciej Jozefczyk <
[email protected] <mailto:[email protected]>> wrote:
>> >>> > >
>> >>> > > Hello!
>> >>> > >
>> >>> > > I would like to ask you to clarify how the OVN Load balancing
algorithm works.
>> >>> > >
>> >>> > > Based on the action [1]:
>> >>> > > 1) If connection is alive the same 'backend' will be chosen,
>> >>> > >
>> >>> > > 2) If it is a new connection the backend will be chosen based on
selection_method=dp_hash [2].
>> >>> > > Based on changelog the dp_hash uses '5 tuple hash' [3].
>> >>> > > The hash is calculated based on values: source and destination
IP,  source port, protocol and arbitrary value - 42. [4]
>> >>> > > Based on that information we could name it SOURCE_IP_PORT.
>> >>> > >
>> >>> > > Unfortunately we recently got a bug report in OVN Octavia
provider driver project, that the Load Balancing in OVN
>> >>> > > works differently [5]. The report shows even when the test uses
the same source ip and port, but new TCP connection,
>> >>> > > traffic is randomly distributed, but based on [2] it shouldn't?
>> >>> > >
>> >>> > > Is it a bug?  Is something else taken to account while creating
a hash? Can it be fixed in OVS/OVN?
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> > > Thanks,
>> >>> > > Maciej
>> >>> > >
>> >>> > >
>> >>> > > [1]
https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1017
>> >>> > > [2]
https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1059
>> >>> > > [3]
https://github.com/openvswitch/ovs/blob/d58b59c17c70137aebdde37d3c01c26a26b28519/NEWS#L364-L371
>> >>> > > [4]
https://github.com/openvswitch/ovs/blob/74286173f4d7f51f78e9db09b07a6d4d65263252/lib/flow.c#L2217
>> >>> > > [5] https://bugs.launchpad.net/neutron/+bug/1871239
>> >>> > >
>> >>> > > --
>> >>> > > Best regards,
>> >>> > > Maciej Józefczyk
>> >>> > > _______________________________________________
>> >>> > > discuss mailing list
>> >>> > > [email protected] <mailto:[email protected]>
>> >>> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> >>> >
>> >>> > Hi Maciej,
>> >>> >
>> >>> > Thanks for reporting. It is definitely strange that same 5-tuple
flow resulted in hitting different backends. I didn't observed such
behavior before (maybe I should try again myself to confirm). Can you make
sure during the testing the group bucket didn't change? You can do so by:
>> >>> > # ovs-ofctl dump-groups br-int
>> >>> > and also check the group stats and see if multiple buckets has
counter increased during the test
>> >>> > # ovs-ofctl dump-group-stats br-int [group]
>> >>> >
>> >>> > For the 5-tuple hash function you are seeing flow_hash_5tuple(),
it is using all the 5-tuples. It adds both ports (src and dst) at once:
>> >>> >        /* Add both ports at once. */
>> >>> >         hash = hash_add(hash,
>> >>> >                         ((const uint32_t *)flow)[offsetof(struct
flow, tp_src)
>> >>> >                                                  /
sizeof(uint32_t)]);
>> >>> >
>> >>> > The tp_src is the start of the offset, and the size is 32, meaning
both src and dst, each is 16 bits. (Although I am not sure if dp_hash
method is using this function or not. Need to check more code)
>> >>> >
>> >>> > BTW, I am not sure why Neutron give it the name SOURCE_IP_PORT.
Shall it be called just 5-TUPLE, since protocol, destination IP and PORT
are also considered in the hash.
>> >>> >
>> >>>
>> >>>
>> >>> Hi Maciej and Han,
>> >>>
>> >>> I did some testing and I can confirm as you're saying. OVN is not
>> >>> choosing the same backend with the src ip, src port fixed.
>> >>>
>> >>> I think there is an issue with OVN on how it is programming the group
>> >>> flows.  OVN is setting the selection_method as dp_hash.
>> >>> But when ovs-vswitchd receives the  GROUP_MOD openflow message, I
>> >>> noticed that the selection_method is not set.
>> >>> From the code I see that selection_method will be encoded only if
>> >>> ovn-controller uses openflow version 1.5 [1]
>> >>>
>> >>> Since selection_method is NULL, vswitchd uses the dp_hash method [2].
>> >>> dp_hash means it uses the hash calculated by
>> >>> the datapath. In the case of kernel datapath, from what I understand
>> >>> it uses skb_get_hash().
>> >>>
>> >>> I modified the vswitchd code to use the selection_method "hash" if
>> >>> selection_method is not set. In this case the load balancer
>> >>> works as expected. For a fixed src ip, src port, dst ip and dst port,
>> >>> the group action is selecting the same bucket always. [3]
>> >>>
>> >>> I think we need to fix a few issues in OVN
>> >>>   - Use openflow 1.5 so that ovn can set selection_method
>> >>>  -  Use "hash" method if dp_hash is not choosing the same bucket for
>> >>> 5-tuple hash.
>> >>>   - May be provide the option for the CMS to choose an algorithm i.e.
>> >>> to use dp_hash or hash.
>> >>>
>> >> I'd rather not expose this to the CMS as it depends on the datapath
implementation as per [0] but maybe it makes sense to eventually abstract
it to the CMS in a more LB-ish way (common algorithm names used in load
balancing) in the case at some point the LB feature is enhanced somehow to
support more algorithms.
>> >>
>> >> I believe that for OVN LB users, using OF 1.5 to force the use of
'hash' would be the best solution now.
>> >>
>> >> My 2 cents as I'm not an LB expert.
>> >>
>> >> I also recall that we tested this in the past and seemed to be
working. I have been checking further in the doc [0] and found this
paragraph:
>> >>
>> >> "If no selection method is specified, Open vSwitch up to release 2.9
applies the hash method with default fields. From 2.10 onwards Open vSwitch
defaults to the dp_hash method with symmetric L3/L4 hash algorithm, unless
the weighted group buck‐ ets cannot be mapped to a maximum of 64 dp_hash
values with sufficient accuracy. In those rare cases Open vSwitch 2.10 and
later fall back to the hash method with the default set of hash fields."
>> >>
>> >> The explanation seems to be that when we tested the feature we relied
on OVS 2.9 and hence the confusion.
>> >>
>> >> Thanks a lot again!
>> >> Daniel
>> >>
>> >> [0] http://www.openvswitch.org/support/dist-docs/ovs-ofctl.8.html
>> >>
>> >>>
>> >>> I'll look into it on how to support this.
>> >>>
>> >>> [1] -
https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2120
>> >>>
https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2082
>> >>>
>> >>> [2] -
https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif.c#L5108
>> >>> [3] -
https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-xlate.c#L4553
>> >>>
>> >>>
>> >>> Thanks
>> >>> Numan
>> >>>
>> >>>
>> >>> > Thanks,
>> >>> > Han
>> >>> > _______________________________________________
>> >>> > discuss mailing list
>> >>> > [email protected] <mailto:[email protected]>
>> >>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> >>> _______________________________________________
>> >>> discuss mailing list
>> >>> [email protected] <mailto:[email protected]>
>> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>
>> _______________________________________________
>> discuss mailing list
>> [email protected]
>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
>
>
> --
> Best regards,
> Maciej Józefczyk
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to