On 7/7/22 18:21, Han Zhou wrote:
> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara <[email protected]> wrote:
>>
>> On 7/7/22 13:45, Dumitru Ceara wrote:
>>> On 7/7/22 00:08, Han Zhou wrote:
>>>> On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara <[email protected]> wrote:
>>>>>
>>>>> Hi Han,
>>>>>
>>>>> On 7/6/22 00:41, Han Zhou wrote:
>>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port
> to
>>>>>> registers is causing a critical dataplane performance impact to
>>>>>> short-lived connections, because it unwildcards megaflows with exact
>>>>>> match on dst IP and L4 ports. Any new connections with a different
>>>>>> client side L4 port will encounter datapath flow miss and upcall to
>>>>>> ovs-vswitchd.
>>>>>>
>>>>>> These fields (dst IP and port) were saved to registers to solve a
>>>>>> problem of LB hairpin use case when different VIPs are sharing
>>>>>> overlapping backend+port [0]. The change [0] might not have as wide
>>>>>> performance impact as it is now because at that time one of the match
>>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
> and
>>>>>> natted traffic, while now the impact is more obvious because
>>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>>>>>> configured on the LS) since commit [1], after several other
> indirectly
>>>>>> related optimizations and refactors.
>>>>>>
>>>>>> Since the changes that introduced the performance problem had their
>>>>>> own values (fixes problems or optimizes performance), so we don't
> want
>>>>>> to revert any of the changes (and it is also not straightforward to
>>>>>> revert any of them because there have been lots of changes and
> refactors
>>>>>> on top of them).
>>>>>>
>>>>>> Change [0] itself has added an alternative way to solve the
> overlapping
>>>>>> backends problem, which utilizes ct fields instead of saving dst IP
> and
>>>>>> port to registers. This patch forces to that approach and removes the
>>>>>> flows/actions that saves the dst IP and port to avoid the dataplane
>>>>>> performance problem for short-lived connections.
>>>>>>
>>>>>> (With this approach, the ct_state DNAT is not HW offload friendly,
> so it
>>>>>> may result in those flows not being offloaded, which is supposed to
> be
>>>>>> solved in a follow-up patch)
>>>>>>
>>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
> shared
>>>> backends.")
>>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> traffic.")
>>>>>>
>>>>>> Signed-off-by: Han Zhou <[email protected]>
>>>>>> ---
>>>>>
>>>>> I think the main concern I have is that this forces us to choose
> between:
>>>>> a. non hwol friendly flows (reduced performance)
>>>>> b. less functionality (with the knob in patch 3/3 set to false).
>>>>>
>>>> Thanks Dumitru for the comments! I agree the solution is not ideal,
> but if
>>>> we look at it from a different angle, even with a), for most
> pod->service
>>>> traffic the performance is still much better than how it is today (not
>>>> offloaded kernel datapath is still much better than userspace
> slowpath).
>>>> And *hopefully* b) is ok for most use cases to get HW-offload
> capability.
>>>>
>>
>> Just a note on this item.  I'm a bit confused about why all traffic
>> would be slowpath-ed?  It's just the first packet that goes to vswitchd
>> as an upcall, right?
>>
> 
> It is about all traffic for *short-lived* connections. Any clients ->
> service traffic with the pattern:
> 1. TCP connection setup
> 2. Set API request, receives response
> 3. Close TCP connection
> It can be tested with netperf TCP_CRR. Every time the client side TCP port
> is different, but since the server -> client DP flow includes the client
> TCP port, for each such transaction there is going to be at least a DP flow
> miss and goes to userspace. Such application latency would be very high. In
> addition, it causes the OVS handler CPU spikes very high which would
> further impact the dataplane performance of the system.
> 
>> Once the megaflow (even if it's more specific than ideal) is installed
>> all following traffic in that session should be forwarded in fast path
>> (kernel).
>>
>> Also, I'm not sure I follow why the same behavior wouldn't happen with
>> your changes too for pod->service.  The datapath flow includes the
>> dp_hash() match, and that's likely different for different connections.
>>
> 
> With the change it is not going to happen, because the match is for server
> side port only.
> For dp_hash(), for what I remembered, there are as many as the number of
> megaflows as the number of buckets (the masked hash value) at most.
> 

OK, that explains it, thanks.

>> Or am I missing something?
>>
>>>>> Change [0] was added to address the case when a service in kubernetes
> is
>>>>> exposed via two different k8s services objects that share the same
>>>>> endpoints.  That translates in ovn-k8s to two different OVN load
>>>>> balancer VIPs that share the same backends.  For such cases, if the
>>>>> service is being accessed by one of its own backends we need to be
> able
>>>>> to differentiate based on the VIP address it used to connect to.
>>>>>
>>>>> CC: Tim Rozet, Dan Williams for some more input from the ovn-k8s side
> on
>>>>> how common it is that an OVN-networked pod accesses two (or more)
>>>>> services that might have the pod itself as a backend.
>>>>>
>>>>
>>>> Yes, we definitely need input from ovn-k8s side. The information we
> got so
>>>> far: the change [0] was to fix a bug [2] reported by Tim. However, the
> bug
>>>> description didn't mention anything about two VIPs sharing the same
>>>> backend. Tim also mentioned in the ovn-k8s meeting last week that the
>>>> original user bug report for [2] was [3], and [3] was in fact a
> completely
>>>> different problem (although it is related to hairpin, too). So, I am
> under
>>>
>>> I am not completely sure about the link between [3] and [2], maybe Tim
>>> remembers more.
>>>
>>>> the impression that "an OVN-networked pod accesses two (or more)
> services
>>>> that might have the pod itself as a backend" might be a very rare use
> case,
>>>> if it exists at all.
>>>>
>>>
>>> I went ahead and set the new ovn-allow-vips-share-hairpin-backend knob
>>> to "false" and pushed it to my fork to run the ovn-kubernetes CI.  This
>>> runs a subset of the kubernetes conformance tests (AFAICT) and some
>>> specific e2e ovn-kubernetes tests.
>>>
>>> The results are here:
>>>
>>> https://github.com/dceara/ovn/runs/7230840427?check_suite_focus=true
>>>
>>> Focusing on the conformance failures:
>>>
>>> 2022-07-07T10:31:24.7228157Z  [91m [1m[Fail]  [0m [90m[sig-network]
> Networking  [0m [0mGranular Checks: Services  [0m [91m [1m[It] should
> function for endpoint-Service: http  [0m
>>> 2022-07-07T10:31:24.7228940Z  [37mvendor/
> github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m
>>> ...
>>> 2022-07-07T10:31:24.7240313Z  [91m [1m[Fail]  [0m [90m[sig-network]
> Networking  [0m [0mGranular Checks: Services  [0m [91m [1m[It] should
> function for multiple endpoint-Services with same selector  [0m
>>> 2022-07-07T10:31:24.7240819Z  [37mvendor/
> github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m
>>> ...
>>>
>>> Checking how these tests are defined:
>>>
> https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L283
>>>
> https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L236
>>>
> Thanks for the test and information! Really need input from k8s folks to
> understand more.
> 
> Thanks,
> Han
>

Like I said below, these are kubernetes conformance tests so I'll let
k8s folks confirm if such failures can be ignored/worked around.

>>> It seems to me that they're testing explicitly for a  "pod that accesses
>>> two services that might have the pod itself as a backend".
>>>
>>> So, if I'm not wrong, we'd become non-compliant in this case.
>>>
>>>>> If this turns out to be mandatory I guess we might want to also look
>>>>> into alternatives like:
>>>>> - getting help from the HW to offload matches like ct_tuple()
>>>>
>>>> I believe this is going to happen in the future. HWOL is continuously
>>>> enhanced.
>>>>
>>>
>>> That would make things simpler.
>>>
>>>>> - limiting the impact of "a." only to some load balancers (e.g., would
>>>>> it help to use different hairpin lookup tables for such load
> balancers?)
>>>>
>>>> I am not sure if this would work, and not sure if this is a good
> approach,
>>>> either. In general, I believe it is possible to solve the problem with
> more
>>>> complex pipelines, but we need to keep in mind it is quite easy to
>>>> introduce other performance problems (either control plane or data
> plane) -
>>>> many of the changes lead to the current implementation were for
> performance
>>>> optimizations, some for control plane, some for HWOL, and some for
> reducing
>>>> recirculations. I'd avoid complexity unless it is really necessary.
> Let's
>>>> get more input for the problem, and based on that we can decide if we
> want
>>>> to move to a more complex solution.
>>>>
>>>
>>> Sure, I agree we need to find the best solution.
>>>
>>> In my option OVN should be HW-agnostic.  We did try to adjust the way
>>> OVN generates openflows in order to make it more "HWOL-friendly" but
>>> that shouldn't come with the cost of breaking CMS features (if that's
>>> the case here).
>>>
>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1931599
>>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1903651
>>>>
>>>> Thanks,
>>>> Han
>>>>
>>>
>>> Thanks,
>>> Dumitru
>>
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to