On Tue, May 20, 2025 at 1:37 AM Han Zhou <zhou...@gmail.com> wrote:
> > > On Mon, May 19, 2025 at 8:41 AM Tim Rozet <tro...@redhat.com> wrote: > > > > Hi All, > > Sorry Han that you are still facing issues with this problem. It seems > like the best option would be to configure the range of ephemeral ports for > OVN to use when doing an SNAT. Note, in OpenShift the nodeport service > range is configured 30000–32767, and the ephemeral port range in the kernel > is set to: > > net.ipv4.ip_local_port_range = 32768 60999 > > > > I think it would make sense to revert and add an option to make this > range configurable in OVN. Then OVN-Kubernetes should pass along this > config to OVN. > > > > As Dumitru mentioned, we can't just revert the previous patch without a > substitute fix in place, otherwise we would risk regressing. > > Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s, > right? > > OVN already provides configurable SNAT port range through the > external_port_range column in the NAT table. See: > https://man7.org/linux/man-pages/man5/ovn-nb.5.html > > external_port_range: string > L4 source port range > > Range of ports, from which a port number will be picked > that will replace the source port of to be NATed packet. > This is basically PAT (port address translation). > > Value of the column is in the format, port_lo-port_hi. For > example: external_port_range : "1-30000" > > Valid range of ports is 1-65535. > > Does this satisfy the need you asked? > Oh, I didn't realize this is already supported. > > If this is agreed, I think we can move forward with below steps: > 1. Update ovn-kubernetes to set the external_port_range, if specific > ranges are configured. (only needed when k8s configured nodeport range > conflicts with the kernel ephemeral port range, and it seems unnecessary > for OpenShift?) > Yeah, we should provide a config knob to set the range though, in case ovnk users do not use the default range. It looks like we already have a test case to cover this scenario as well: https://github.com/openshift/ovn-kubernetes/pull/2120/commits/34b1a4be8cf9d4bf61d65464015c1b3cdb30adfd I'll post a PR to do this part. 2. Revert the patch "northd: Don't skip the unSNAT stage for traffic > towards VIPs" in OVN. Also revert some of the changes for ct-commit-all: > e.g. no need to commit DNATted connections to SNAT zone, and vice versa. > Sure, after 1 gets in, you can try to post a PR reverting the change and lets see if it passes the above test. > > What do you think? > > Thanks, > Han > > > > > Thanks, > > > > Tim Rozet > > Red Hat OpenShift Networking Team > > > > > > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dce...@redhat.com> wrote: > >> > >> On 5/16/25 1:22 PM, Dumitru Ceara wrote: > >> > On 5/15/25 6:24 PM, Han Zhou wrote: > >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <tro...@redhat.com> wrote: > >> >>> > >> >>> Hi Han, > >> >>> My understanding is that the unSNAT stage was originally being > skipped > >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this > breaks > >> >> HWOL for ingress connections that never need to get unSNAT'ed and > therefore > >> >> not committed. In the OVNK use case for pods simply connecting > egress, > >> >> almost all packets will be SNAT'ed to the node IP. Therefore on > egress > >> >> reply, we always enter the unSNAT stage for these packets. So this > skipping > >> >> unSNAT stage seems to be specific to LB related traffic. While I > agree > >> >> splitting the port range would work, I think it makes more sense to > just > >> >> always commit in the unSNAT stage for all traffic. I get there is a > >> >> performance hit there, but the datapath pipeline seems more > consistent and > >> >> I think outweighs the cost of committing LB traffic. > >> >> > >> >> Hi Tim and folks, > >> >> > >> > > >> > Hi Han, > >> > > >> >> + @Ales Musil <amu...@redhat.com> > >> >> > >> >> We've spent a lot of effort on the alternative solution: Always > commit to > >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue > discussed in > >> >> this thread is still unresolved. I'd like to discuss the current > situation > >> >> and ask for opinions. > >> >> > >> > > >> > Thanks again for working on this! > >> > > >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579 > ("northd: > >> >> Add LR option to commit all traffic."). When the ct-commit-all > option is > >> >> enabled, it commits the connection to both SNAT and DNAT zones. > However, in > >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW > offload is > >> >> still broken. In that scenario, a packet may come from external > network > >> >> with dest IP being a node's IP, which is also used as an OVN gateway > router > >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the > src IP > >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in > different > >> >> stages are: > >> >> > >> >> Original: EXT IP <-> GR IP1 > >> >> After DNAT: EXT IP <-> BE IP > >> >> After SNAT: GR IP2 <-> BE IP > >> >> > >> >> With the ct-commit-all option, it commits after DNAT and after SNAT: > >> >> DNAT zone: EXT IP <-> GR IP1/BE IP > >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >> >> > >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the > >> >> original header, which will still always be "new" state, which > breaks HW > >> >> offload. > >> >> > >> > > >> > Ah, now I understand the problem you were facing. > >> > > >> >> I tried to fix it by also committing to the SNAT zone before DNAT > (at the > >> >> POST_SNAT stage). Now there are 3 entries committed: > >> >> SNAT zone: EXT IP <-> GR IP1 > >> >> DNAT zone: EXT IP <-> GR IP1/BE IP > >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP > >> >> > >> >> HW offload worked, but it breaks at least the scenario of "DNAT and > SNAT on > >> >> distributed router - E/W", which tests communication between two > workloads > >> >> using "floating IPs". In this scenario, both workloads are under > different > >> >> LSes behind the same router, and both have their own floating IPs > >> >> (dnat_and_snat) on the public side of the LR. When one of them sends > >> >> packets to the other using the floating IP, what's expected is: > >> >> > >> >> Original: overlay IP1 <-> floating IP2 > >> >> After SNAT: floating IP1 <-> floating IP2 > >> >> After DNAT: floating IP1 <-> overlay IP2 > >> >> > >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone: > >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2 > >> >> > >> >> This happens in the egress pipeline of the LR, and then it enters > into the > >> >> ingress pipeline again because this is like a hairpin (the dest IP > is a > >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage > first, > >> >> traverse the CT zone, and for the first packet the reply is not seen > yet, > >> >> so the CT state is still new, so in the POST_UNSNAT we will try to > commit > >> >> it (because of the issue I am trying to fix above). Because the > connection > >> >> floating IP1 <-> floating IP2 is conflicting with the above > committed entry > >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action > fails and > >> >> the packet is dropped. > >> >> > >> >> There is still a way to solve this, probably by skipping the > unnecessary > >> >> UNSNAT traversal and the conflict commit if the source IP is one of > the > >> >> snat/dnat_and_snat IPs of this same router. > >> >> > >> >> However, as we see, the NAT related pipelines have been very complex > today, > >> >> and with the above changes it would become even much more complex. I > wonder > >> >> if this is maintainable in the long run. > >> >> > >> >> In addition, I am not even sure if this is the last problem of my > patch. It > >> >> is not easy to tell because today the ct-commit-all feature is an > option > >> >> and by default disabled. So all our test cases by default don't test > for > >> >> this option except the 4 test cases added by Ales's patch. I had to > >> >> manually set this option to true as default and of course many test > cases > >> >> will fail. Most of them fail because of the expected flows won't > match, but > >> >> I can't tell for sure unless updating every case with new flows and > then > >> >> retest them. Before doing that I just examined ones I think might be > >> >> impacted by the patch and the "DNAT and SNAT on distributed router - > E/W" > >> >> is the first broken one I found. So there can be other issues not > >> >> discovered yet. > >> >> > >> >> Moreover, we should also keep in mind the performance penalty of > these > >> >> extra CT commit & traversal. > >> >> > >> > > >> > But basically this happens because we use the same SNAT zone for all > >> > router ports. > >> > > >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per > >> > router port fix it? E.g.: > >> > > >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: > 21, > >> > DNAT ZONE: 22) ---- host2 > >> > > >> > That'd also imply that we commit that session (from host1 to host2 - > >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for > >> > LRP1 and SNAT and DNAT for LRP2). And we'd also have to do 4 > different > >> > CT lookups for all packets traversing the router. > >> > > >> > I didn't try any of this but I have the impression that this would > solve > >> > the HWOL issue: all subsequent packets are part of known and committed > >> > sessions in all 4 conntrack zones. > >> > > >> > So what if we change the implementation of the "commit-all" feature to > >> > add the missing ct commits? Wouldn't that work? > >> > > >> > It feels like (lookup performance hit aside) this is the correct way > to > >> > forward traffic, i.e., never traverse a CT zone without having a > >> > committed entry in that zone. > >> > > >> >> I'd like to listen to your opinion before moving forward. Is this > worth the > >> >> effort, or shall we just keep the pipeline simple (relatively) and > avoid > >> >> the original problem (https://issues.redhat.com/browse/FDP-291) by > >> >> separating the port range of SNAT and DNAT, the first proposal at the > >> >> beginning of this thread? > >> >> > >> > > >> > Now that I understand the problem you're describing better it might be > >> > worth reverting the fix for FDP-291 like you said until we find a > proper > >> > generic solution. > >> > > >> > However, we can't afford to do that without separating the port range > of > >> > SNAT and DNAT like you said [0] because we'd be regressing (at least > >> > from OpenShift perspective). > >> > > >> > >> For reference, Tim, this is the OpenShift bug that needed the fix for > >> FDP-291: > >> > >> https://issues.redhat.com/browse/OCPBUGS-25889 > >> > >> > Because selecting the nodeport (DNAT) range is done through a k8s API > >> > server cmdline argument that means that users can in theory choose > >> > whatever range they want. I guess we'd need a command line argument > in > >> > ovn-kubernetes to allow users (operators, etc.) to inform > ovn-kubernetes > >> > what SNAT port range to use so that it doesn't conflict with the KAPI > >> > nodeport range. > >> > > >> > Tim, you're the expert on this side, what are your thoughts on the > matter? > >> > > >> > Regards, > >> > Dumitru > >> > > >> > [0] > https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html > >> > > >> >> Thanks, > >> >> Han > >> >> > >> >>> > >> >>> Thanks, > >> >>> Tim Rozet > >> >>> Red Hat OpenShift Networking Team > >> >>> > >> >>> > >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dce...@redhat.com> > wrote: > >> >>>> > >> >>>> On 8/29/24 18:14, Han Zhou wrote: > >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dce...@redhat.com> > >> >> wrote: > >> >>>>>> > >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote: > >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara < > dce...@redhat.com> > >> >> wrote: > >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the > >> >> VIP as > >> >>>>>>>> external IP, we break sessions initiated from behind the VIP. > >> >>>>>>>> > >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat > flows > >> >> for > >> >>>>>>>> load balancer vips in router ingress pipeline"). That's OK > because > >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for > routers > >> >> with > >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better > way: > >> >>>>>>>> > >> >>>>>>>> In the reply direction, the order of traversal of the > tables > >> >>>>>>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds > incorrect > >> >>>>>>>> datapath flows that check ct_state in the wrong conntrack > zone. > >> >>>>>>>> This is illustrated below where reply trafic enters the > >> >> physical host > >> >>>>>>>> port (6) and traverses DNAT zone (14), SNAT zone (default), > >> >> back to the > >> >>>>>>>> DNAT zone and then on to Logical Switch Port zone (22). The > >> >> third > >> >>>>>>>> flow is incorrectly checking the state from the SNAT zone > >> >> instead > >> >>>>>>>> of the DNAT zone. > >> >>>>>>>> > >> >>>>>>>> We also add a system test to ensure traffic initiated from > behind a > >> >> VIP > >> >>>>>>>> + SNAT is not broken. > >> >>>>>>>> > >> >>>>>>>> Another nice side effect is that the northd I-P is slightly > >> >> simplified > >> >>>>>>>> because we don't need to track NAT external IPs anymore. > >> >>>>>>>> > >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load > >> >> balancer vips in router ingress pipeline") > >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291 > >> >>>>>>>> Signed-off-by: Dumitru Ceara <dce...@redhat.com> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Thanks for the fix. It also simplified the lr-nat-stateful > code. > >> >>>>>>> > >> >>>>>>> Acked-by: Numan Siddique <num...@ovn.org> > >> >>>>>>> > >> >>>>>> > >> >>>>>> Thanks, Numan! > >> >>>>>> > >> >>>>>> Applied to main and backported to all branches down to 22.03. > >> >>>>>> > >> >>>>>> Regards, > >> >>>>>> Dumitru > >> >>>>>> > >> >>>>>> _______________________________________________ > >> >>>>>> dev mailing list > >> >>>>>> d...@openvswitch.org > >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >> >>>>> > >> >>>>> Hi Dumitru, Numan, Tim and folks, > >> >>>>> > >> >>>> > >> >>>> Hi Han, > >> >>>> > >> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken > due to > >> >>>>> this change. The reason is that for client to nodePort (LB with > VIP > >> >>>>> being the node IP) traffic, when the packet is going through the > >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never > committed > >> >>>>> to the SNAT zone, it will have CT state returned as "new", which > >> >>>>> prevents the HW offload to work for such packets. > >> >>>>> > >> >>>> > >> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone. > >> >>>> > >> >>>>> At the moment I have to revert this change in our downstream. For > the > >> >>>>> problem that was fixed by this change [0], I think we can avoid > it by > >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort > >> >>>>> range in k8s is configured by API-server option: > >> >>>>> > >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default: > >> >> 30000-32767 > >> >>>>> > >> >>>>> For SNAT, it can be configured in the OVN's NAT table's > >> >>>>> external_port_range column, and we can choose something like > >> >>>>> 10000-30000. > >> >>>>> > >> >>>> > >> >>>> Tim, does this look OK to you? If it's acceptable to limit the > SNAT > >> >>>> port range this workaround should be fine. > >> >>>> > >> >>>>> An extra benefit of this is that it reduces a CT recirc. > >> >>>>> > >> >>>>> Alternatively solutions are: > >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This > would > >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc. > >> >>>> > >> >>>> Extra recirculation aside, I would actually love it if we could > use this > >> >>>> alternative. I think it's the "most correct" option. I think it > would > >> >>>> allow us to avoid other workarounds like: > >> >>>> > >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8 > >> >>>> > >> >>>> or > >> >>>> > >> >>>> > >> >> > https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amu...@redhat.com/ > >> >>>> > >> >>>> I do understand the worry about the extra recirculation though. > In the > >> >>>> HWOL context does that cause visible performance impact? > >> >>>> > >> >>>> We'd probably have to do more performance testing without HWOL to > figure > >> >>>> out the impact in the software datapath. > >> >>>> > >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are > other > >> >>>>> issues reported for using the common zone [1] > >> >>>>> > >> >>>>> Could you let me know if other thoughts on this? > >> >>>>> > >> >>>> > >> >>>> On a related note, I know it has been discussed in different > settings > >> >>>> but I don't think this ever moved forward: would it be possible for > >> >>>> NVIDIA to help out with automatically testing HWOL impact for > incoming > >> >>>> patches? > >> >>>> > >> >>>> Maybe we could some "simple" system-like tests that ensure that > traffic > >> >>>> is correctly offloaded in common scenarios? Alternatively, I > guess we > >> >>>> could also tag a subset of the existing system tests and just run > those > >> >>>> on actual hardware? > >> >>>> > >> >>>> It's quite simple (AFAIU) for external CIs to report status on > each OVN > >> >>>> patch posted on patchwork. That would at least allow us to flag > this > >> >>>> kind of breakages early (even before they get merged). > >> >>>> > >> >>>> What do you think? > >> >>>> > >> >>>>> [0] https://issues.redhat.com/browse/FDP-291 > >> >>>>> [1] b8c40e7593 > >> >> > https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 > >> >>>>> > >> >>>>> Thanks, > >> >>>>> Han > >> >>>>> > >> >>>> > >> >>>> Thanks, > >> >>>> Dumitru > >> >> _______________________________________________ > >> >> dev mailing list > >> >> d...@openvswitch.org > >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev