On 7/7/22 18:21, Han Zhou wrote: > On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara <[email protected]> wrote: >> >> On 7/7/22 13:45, Dumitru Ceara wrote: >>> On 7/7/22 00:08, Han Zhou wrote: >>>> On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara <[email protected]> wrote: >>>>> >>>>> Hi Han, >>>>> >>>>> On 7/6/22 00:41, Han Zhou wrote: >>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port > to >>>>>> registers is causing a critical dataplane performance impact to >>>>>> short-lived connections, because it unwildcards megaflows with exact >>>>>> match on dst IP and L4 ports. Any new connections with a different >>>>>> client side L4 port will encounter datapath flow miss and upcall to >>>>>> ovs-vswitchd. >>>>>> >>>>>> These fields (dst IP and port) were saved to registers to solve a >>>>>> problem of LB hairpin use case when different VIPs are sharing >>>>>> overlapping backend+port [0]. The change [0] might not have as wide >>>>>> performance impact as it is now because at that time one of the match >>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established > and >>>>>> natted traffic, while now the impact is more obvious because >>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP >>>>>> configured on the LS) since commit [1], after several other > indirectly >>>>>> related optimizations and refactors. >>>>>> >>>>>> Since the changes that introduced the performance problem had their >>>>>> own values (fixes problems or optimizes performance), so we don't > want >>>>>> to revert any of the changes (and it is also not straightforward to >>>>>> revert any of them because there have been lots of changes and > refactors >>>>>> on top of them). >>>>>> >>>>>> Change [0] itself has added an alternative way to solve the > overlapping >>>>>> backends problem, which utilizes ct fields instead of saving dst IP > and >>>>>> port to registers. This patch forces to that approach and removes the >>>>>> flows/actions that saves the dst IP and port to avoid the dataplane >>>>>> performance problem for short-lived connections. >>>>>> >>>>>> (With this approach, the ct_state DNAT is not HW offload friendly, > so it >>>>>> may result in those flows not being offloaded, which is supposed to > be >>>>>> solved in a follow-up patch) >>>>>> >>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with > shared >>>> backends.") >>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer > traffic.") >>>>>> >>>>>> Signed-off-by: Han Zhou <[email protected]> >>>>>> --- >>>>> >>>>> I think the main concern I have is that this forces us to choose > between: >>>>> a. non hwol friendly flows (reduced performance) >>>>> b. less functionality (with the knob in patch 3/3 set to false). >>>>> >>>> Thanks Dumitru for the comments! I agree the solution is not ideal, > but if >>>> we look at it from a different angle, even with a), for most > pod->service >>>> traffic the performance is still much better than how it is today (not >>>> offloaded kernel datapath is still much better than userspace > slowpath). >>>> And *hopefully* b) is ok for most use cases to get HW-offload > capability. >>>> >> >> Just a note on this item. I'm a bit confused about why all traffic >> would be slowpath-ed? It's just the first packet that goes to vswitchd >> as an upcall, right? >> > > It is about all traffic for *short-lived* connections. Any clients -> > service traffic with the pattern: > 1. TCP connection setup > 2. Set API request, receives response > 3. Close TCP connection > It can be tested with netperf TCP_CRR. Every time the client side TCP port > is different, but since the server -> client DP flow includes the client > TCP port, for each such transaction there is going to be at least a DP flow > miss and goes to userspace. Such application latency would be very high. In > addition, it causes the OVS handler CPU spikes very high which would > further impact the dataplane performance of the system. > >> Once the megaflow (even if it's more specific than ideal) is installed >> all following traffic in that session should be forwarded in fast path >> (kernel). >> >> Also, I'm not sure I follow why the same behavior wouldn't happen with >> your changes too for pod->service. The datapath flow includes the >> dp_hash() match, and that's likely different for different connections. >> > > With the change it is not going to happen, because the match is for server > side port only. > For dp_hash(), for what I remembered, there are as many as the number of > megaflows as the number of buckets (the masked hash value) at most. >
OK, that explains it, thanks. >> Or am I missing something? >> >>>>> Change [0] was added to address the case when a service in kubernetes > is >>>>> exposed via two different k8s services objects that share the same >>>>> endpoints. That translates in ovn-k8s to two different OVN load >>>>> balancer VIPs that share the same backends. For such cases, if the >>>>> service is being accessed by one of its own backends we need to be > able >>>>> to differentiate based on the VIP address it used to connect to. >>>>> >>>>> CC: Tim Rozet, Dan Williams for some more input from the ovn-k8s side > on >>>>> how common it is that an OVN-networked pod accesses two (or more) >>>>> services that might have the pod itself as a backend. >>>>> >>>> >>>> Yes, we definitely need input from ovn-k8s side. The information we > got so >>>> far: the change [0] was to fix a bug [2] reported by Tim. However, the > bug >>>> description didn't mention anything about two VIPs sharing the same >>>> backend. Tim also mentioned in the ovn-k8s meeting last week that the >>>> original user bug report for [2] was [3], and [3] was in fact a > completely >>>> different problem (although it is related to hairpin, too). So, I am > under >>> >>> I am not completely sure about the link between [3] and [2], maybe Tim >>> remembers more. >>> >>>> the impression that "an OVN-networked pod accesses two (or more) > services >>>> that might have the pod itself as a backend" might be a very rare use > case, >>>> if it exists at all. >>>> >>> >>> I went ahead and set the new ovn-allow-vips-share-hairpin-backend knob >>> to "false" and pushed it to my fork to run the ovn-kubernetes CI. This >>> runs a subset of the kubernetes conformance tests (AFAICT) and some >>> specific e2e ovn-kubernetes tests. >>> >>> The results are here: >>> >>> https://github.com/dceara/ovn/runs/7230840427?check_suite_focus=true >>> >>> Focusing on the conformance failures: >>> >>> 2022-07-07T10:31:24.7228157Z [91m [1m[Fail] [0m [90m[sig-network] > Networking [0m [0mGranular Checks: Services [0m [91m [1m[It] should > function for endpoint-Service: http [0m >>> 2022-07-07T10:31:24.7228940Z [37mvendor/ > github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m >>> ... >>> 2022-07-07T10:31:24.7240313Z [91m [1m[Fail] [0m [90m[sig-network] > Networking [0m [0mGranular Checks: Services [0m [91m [1m[It] should > function for multiple endpoint-Services with same selector [0m >>> 2022-07-07T10:31:24.7240819Z [37mvendor/ > github.com/onsi/ginkgo/internal/leafnodes/runner.go:113 [0m >>> ... >>> >>> Checking how these tests are defined: >>> > https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L283 >>> > https://github.com/kubernetes/kubernetes/blob/2a017f94bcf8d04cbbbbdc6695bcf74273d630ed/test/e2e/network/networking.go#L236 >>> > Thanks for the test and information! Really need input from k8s folks to > understand more. > > Thanks, > Han > Like I said below, these are kubernetes conformance tests so I'll let k8s folks confirm if such failures can be ignored/worked around. >>> It seems to me that they're testing explicitly for a "pod that accesses >>> two services that might have the pod itself as a backend". >>> >>> So, if I'm not wrong, we'd become non-compliant in this case. >>> >>>>> If this turns out to be mandatory I guess we might want to also look >>>>> into alternatives like: >>>>> - getting help from the HW to offload matches like ct_tuple() >>>> >>>> I believe this is going to happen in the future. HWOL is continuously >>>> enhanced. >>>> >>> >>> That would make things simpler. >>> >>>>> - limiting the impact of "a." only to some load balancers (e.g., would >>>>> it help to use different hairpin lookup tables for such load > balancers?) >>>> >>>> I am not sure if this would work, and not sure if this is a good > approach, >>>> either. In general, I believe it is possible to solve the problem with > more >>>> complex pipelines, but we need to keep in mind it is quite easy to >>>> introduce other performance problems (either control plane or data > plane) - >>>> many of the changes lead to the current implementation were for > performance >>>> optimizations, some for control plane, some for HWOL, and some for > reducing >>>> recirculations. I'd avoid complexity unless it is really necessary. > Let's >>>> get more input for the problem, and based on that we can decide if we > want >>>> to move to a more complex solution. >>>> >>> >>> Sure, I agree we need to find the best solution. >>> >>> In my option OVN should be HW-agnostic. We did try to adjust the way >>> OVN generates openflows in order to make it more "HWOL-friendly" but >>> that shouldn't come with the cost of breaking CMS features (if that's >>> the case here). >>> >>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1931599 >>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1903651 >>>> >>>> Thanks, >>>> Han >>>> >>> >>> Thanks, >>> Dumitru >> > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
