Re: [ovs-dev] [PATCH ovn] northd: Don't skip the unSNAT stage for traffic towards VIPs.

Tim Rozet via dev Tue, 20 May 2025 06:20:45 -0700

On Tue, May 20, 2025 at 1:37 AM Han Zhou <zhou...@gmail.com> wrote:


>
>
> On Mon, May 19, 2025 at 8:41 AM Tim Rozet <tro...@redhat.com> wrote:
> >
> > Hi All,
> > Sorry Han that you are still facing issues with this problem. It seems
> like the best option would be to configure the range of ephemeral ports for
> OVN to use when doing an SNAT. Note, in OpenShift the nodeport service
> range is configured 30000–32767, and the ephemeral port range in the kernel
> is set to:
> > net.ipv4.ip_local_port_range = 32768 60999
> >
> > I think it would make sense to revert and add an option to make this
> range configurable in OVN. Then OVN-Kubernetes should pass along this
> config to OVN.
> >
> > As Dumitru mentioned, we can't just revert the previous patch without a
> substitute fix in place, otherwise we would risk regressing.
>
> Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s,
> right?
>

> OVN already provides configurable SNAT port range through the
> external_port_range column in the NAT table. See:
> https://man7.org/linux/man-pages/man5/ovn-nb.5.html
>
>        external_port_range: string
>               L4 source port range
>
>               Range of ports, from which a port number will be picked
>               that will replace the source port of to be NATed packet.
>               This is basically PAT (port address translation).
>
>               Value of the column is in the format, port_lo-port_hi. For
>               example: external_port_range : "1-30000"
>
>               Valid range of ports is 1-65535.
>
> Does this satisfy the need you asked?
>

Oh, I didn't realize this is already supported.


>
> If this is agreed, I think we can move forward with below steps:
> 1. Update ovn-kubernetes to set the external_port_range, if specific
> ranges are configured. (only needed when k8s configured nodeport range
> conflicts with the kernel ephemeral port range, and it seems unnecessary
> for OpenShift?)
>
Yeah, we should provide a config knob to set the range though, in case
ovnk users do not use the default range. It looks like we already have a
test case to cover this scenario as well:
https://github.com/openshift/ovn-kubernetes/pull/2120/commits/34b1a4be8cf9d4bf61d65464015c1b3cdb30adfd

I'll post a PR to do this part.

2. Revert the patch "northd: Don't skip the unSNAT stage for traffic
> towards VIPs" in OVN. Also revert some of the changes for ct-commit-all:
> e.g. no need to commit DNATted connections to SNAT zone, and vice versa.
>
Sure, after 1 gets in, you can try to post a PR reverting the change and
lets see if it passes the above test.


>
> What do you think?
>
> Thanks,
> Han
>
> >
> > Thanks,
> >
> > Tim Rozet
> > Red Hat OpenShift Networking Team
> >
> >
> > On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dce...@redhat.com> wrote:
> >>
> >> On 5/16/25 1:22 PM, Dumitru Ceara wrote:
> >> > On 5/15/25 6:24 PM, Han Zhou wrote:
> >> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <tro...@redhat.com> wrote:
> >> >>>
> >> >>> Hi Han,
> >> >>> My understanding is that the unSNAT stage was originally being
> skipped
> >> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this
> breaks
> >> >> HWOL for ingress connections that never need to get unSNAT'ed and
> therefore
> >> >> not committed. In the OVNK use case for pods simply connecting
> egress,
> >> >> almost all packets will be SNAT'ed to the node IP. Therefore on
> egress
> >> >> reply, we always enter the unSNAT stage for these packets. So this
> skipping
> >> >> unSNAT stage seems to be specific to LB related traffic. While I
> agree
> >> >> splitting the port range would work, I think it makes more sense to
> just
> >> >> always commit in the unSNAT stage for all traffic. I get there is a
> >> >> performance hit there, but the datapath pipeline seems more
> consistent and
> >> >> I think outweighs the cost of committing LB traffic.
> >> >>
> >> >> Hi Tim and folks,
> >> >>
> >> >
> >> > Hi Han,
> >> >
> >> >> + @Ales Musil <amu...@redhat.com>
> >> >>
> >> >> We've spent a lot of effort on the alternative solution: Always
> commit to
> >> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue
> discussed in
> >> >> this thread is still unresolved. I'd like to discuss the current
> situation
> >> >> and ask for opinions.
> >> >>
> >> >
> >> > Thanks again for working on this!
> >> >
> >> >> A while ago, Ales tried to fix it with this commit: 800fd0681579
> ("northd:
> >> >> Add LR option to commit all traffic."). When the ct-commit-all
> option is
> >> >> enabled, it commits the connection to both SNAT and DNAT zones.
> However, in
> >> >> the ovn-k8s nodeport scenario (a very typical use case) the HW
> offload is
> >> >> still broken. In that scenario, a packet may come from external
> network
> >> >> with dest IP being a node's IP, which is also used as an OVN gateway
> router
> >> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the
> src IP
> >> >> is SNATed to another LRP's IP on this router. The src/dst IPs in
> different
> >> >> stages are:
> >> >>
> >> >> Original: EXT IP <-> GR IP1
> >> >> After DNAT: EXT IP <-> BE IP
> >> >> After SNAT: GR IP2 <-> BE IP
> >> >>
> >> >> With the ct-commit-all option, it commits after DNAT and after SNAT:
> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >> >>
> >> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the
> >> >> original header, which will still always be "new" state, which
> breaks HW
> >> >> offload.
> >> >>
> >> >
> >> > Ah, now I understand the problem you were facing.
> >> >
> >> >> I tried to fix it by also committing to the SNAT zone before DNAT
> (at the
> >> >> POST_SNAT stage). Now there are 3 entries committed:
> >> >> SNAT zone: EXT IP <-> GR IP1
> >> >> DNAT zone: EXT IP <-> GR IP1/BE IP
> >> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
> >> >>
> >> >> HW offload worked, but it breaks at least the scenario of "DNAT and
> SNAT on
> >> >> distributed router - E/W", which tests communication between two
> workloads
> >> >> using "floating IPs". In this scenario, both workloads are under
> different
> >> >> LSes behind the same router, and both have their own floating IPs
> >> >> (dnat_and_snat) on the public side of the LR. When one of them sends
> >> >> packets to the other using the floating IP, what's expected is:
> >> >>
> >> >> Original: overlay IP1 <-> floating IP2
> >> >> After SNAT: floating IP1 <-> floating IP2
> >> >> After DNAT: floating IP1 <-> overlay IP2
> >> >>
> >> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone:
> >> >> SNAT: overlay IP1/floating IP1 <-> floating IP2
> >> >>
> >> >> This happens in the egress pipeline of the LR, and then it enters
> into the
> >> >> ingress pipeline again because this is like a hairpin (the dest IP
> is a
> >> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage
> first,
> >> >> traverse the CT zone, and for the first packet the reply is not seen
> yet,
> >> >> so the CT state is still new, so in the POST_UNSNAT we will try to
> commit
> >> >> it (because of the issue I am trying to fix above). Because the
> connection
> >> >> floating IP1 <-> floating IP2 is conflicting with the above
> committed entry
> >> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action
> fails and
> >> >> the packet is dropped.
> >> >>
> >> >> There is still a way to solve this, probably by skipping the
> unnecessary
> >> >> UNSNAT traversal and the conflict commit if the source IP is one of
> the
> >> >> snat/dnat_and_snat IPs of this same router.
> >> >>
> >> >> However, as we see, the NAT related pipelines have been very complex
> today,
> >> >> and with the above changes it would become even much more complex. I
> wonder
> >> >> if this is maintainable in the long run.
> >> >>
> >> >> In addition, I am not even sure if this is the last problem of my
> patch. It
> >> >> is not easy to tell because today the ct-commit-all feature is an
> option
> >> >> and by default disabled. So all our test cases by default don't test
> for
> >> >> this option except the 4 test cases added by Ales's patch. I had to
> >> >> manually set this option to true as default and of course many test
> cases
> >> >> will fail. Most of them fail because of the expected flows won't
> match, but
> >> >> I can't tell for sure unless updating every case with new flows and
> then
> >> >> retest them. Before doing that I just examined ones I think might be
> >> >> impacted by the patch and the "DNAT and SNAT on distributed router -
> E/W"
> >> >> is the first broken one I found. So there can be other issues not
> >> >> discovered yet.
> >> >>
> >> >> Moreover, we should also keep in mind the performance penalty of
> these
> >> >> extra CT commit & traversal.
> >> >>
> >> >
> >> > But basically this happens because we use the same SNAT zone for all
> >> > router ports.
> >> >
> >> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per
> >> > router port fix it?  E.g.:
> >> >
> >> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE:
> 21,
> >> > DNAT ZONE: 22) ---- host2
> >> >
> >> > That'd also imply that we commit that session (from host1 to host2 -
> >> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for
> >> > LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do 4
> different
> >> > CT lookups for all packets traversing the router.
> >> >
> >> > I didn't try any of this but I have the impression that this would
> solve
> >> > the HWOL issue: all subsequent packets are part of known and committed
> >> > sessions in all 4 conntrack zones.
> >> >
> >> > So what if we change the implementation of the "commit-all" feature to
> >> > add the missing ct commits?  Wouldn't that work?
> >> >
> >> > It feels like (lookup performance hit aside) this is the correct way
> to
> >> > forward traffic, i.e., never traverse a CT zone without having a
> >> > committed entry in that zone.
> >> >
> >> >> I'd like to listen to your opinion before moving forward. Is this
> worth the
> >> >> effort, or shall we just keep the pipeline simple (relatively) and
> avoid
> >> >> the original problem (https://issues.redhat.com/browse/FDP-291) by
> >> >> separating the port range of SNAT and DNAT, the first proposal at the
> >> >> beginning of this thread?
> >> >>
> >> >
> >> > Now that I understand the problem you're describing better it might be
> >> > worth reverting the fix for FDP-291 like you said until we find a
> proper
> >> > generic solution.
> >> >
> >> > However, we can't afford to do that without separating the port range
> of
> >> > SNAT and DNAT like you said [0] because we'd be regressing (at least
> >> > from OpenShift perspective).
> >> >
> >>
> >> For reference, Tim, this is the OpenShift bug that needed the fix for
> >> FDP-291:
> >>
> >> https://issues.redhat.com/browse/OCPBUGS-25889
> >>
> >> > Because selecting the nodeport (DNAT) range is done through a k8s API
> >> > server cmdline argument that means that users can in theory choose
> >> > whatever range they want.  I guess we'd need a command line argument
> in
> >> > ovn-kubernetes to allow users (operators, etc.) to inform
> ovn-kubernetes
> >> > what SNAT port range to use so that it doesn't conflict with the KAPI
> >> > nodeport range.
> >> >
> >> > Tim, you're the expert on this side, what are your thoughts on the
> matter?
> >> >
> >> > Regards,
> >> > Dumitru
> >> >
> >> > [0]
> https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
> >> >
> >> >> Thanks,
> >> >> Han
> >> >>
> >> >>>
> >> >>> Thanks,
> >> >>> Tim Rozet
> >> >>> Red Hat OpenShift Networking Team
> >> >>>
> >> >>>
> >> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dce...@redhat.com>
> wrote:
> >> >>>>
> >> >>>> On 8/29/24 18:14, Han Zhou wrote:
> >> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dce...@redhat.com>
> >> >> wrote:
> >> >>>>>>
> >> >>>>>> On 3/5/24 15:56, Numan Siddique wrote:
> >> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <
> dce...@redhat.com>
> >> >> wrote:
> >> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the
> >> >> VIP as
> >> >>>>>>>> external IP, we break sessions initiated from behind the VIP.
> >> >>>>>>>>
> >> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat
> flows
> >> >> for
> >> >>>>>>>> load balancer vips in router ingress pipeline").  That's OK
> because
> >> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for
> routers
> >> >> with
> >> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better
> way:
> >> >>>>>>>>
> >> >>>>>>>>     In the reply direction, the order of traversal of the
> tables
> >> >>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds
> incorrect
> >> >>>>>>>>     datapath flows that check ct_state in the wrong conntrack
> zone.
> >> >>>>>>>>     This is illustrated below where reply trafic enters the
> >> >> physical host
> >> >>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone (default),
> >> >> back to the
> >> >>>>>>>>     DNAT zone and then on to Logical Switch Port zone (22). The
> >> >> third
> >> >>>>>>>>     flow is incorrectly checking the state from the SNAT zone
> >> >> instead
> >> >>>>>>>>     of the DNAT zone.
> >> >>>>>>>>
> >> >>>>>>>> We also add a system test to ensure traffic initiated from
> behind a
> >> >> VIP
> >> >>>>>>>> + SNAT is not broken.
> >> >>>>>>>>
> >> >>>>>>>> Another nice side effect is that the northd I-P is slightly
> >> >> simplified
> >> >>>>>>>> because we don't need to track NAT external IPs anymore.
> >> >>>>>>>>
> >> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load
> >> >> balancer vips in router ingress pipeline")
> >> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
> >> >>>>>>>> Signed-off-by: Dumitru Ceara <dce...@redhat.com>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> Thanks for the fix.  It also simplified the lr-nat-stateful
> code.
> >> >>>>>>>
> >> >>>>>>> Acked-by: Numan Siddique <num...@ovn.org>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> Thanks, Numan!
> >> >>>>>>
> >> >>>>>> Applied to main and backported to all branches down to 22.03.
> >> >>>>>>
> >> >>>>>> Regards,
> >> >>>>>> Dumitru
> >> >>>>>>
> >> >>>>>> _______________________________________________
> >> >>>>>> dev mailing list
> >> >>>>>> d...@openvswitch.org
> >> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >> >>>>>
> >> >>>>> Hi Dumitru, Numan, Tim and folks,
> >> >>>>>
> >> >>>>
> >> >>>> Hi Han,
> >> >>>>
> >> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken
> due to
> >> >>>>> this change. The reason is that for client to nodePort (LB with
> VIP
> >> >>>>> being the node IP) traffic, when the packet is going through the
> >> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never
> committed
> >> >>>>> to the SNAT zone, it will have CT state returned as "new", which
> >> >>>>> prevents the HW offload to work for such packets.
> >> >>>>>
> >> >>>>
> >> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone.
> >> >>>>
> >> >>>>> At the moment I have to revert this change in our downstream. For
> the
> >> >>>>> problem that was fixed by this change [0], I think we can avoid
> it by
> >> >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort
> >> >>>>> range in k8s is configured by API-server option:
> >> >>>>>
> >> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default:
> >> >> 30000-32767
> >> >>>>>
> >> >>>>> For SNAT, it can be configured in the OVN's NAT table's
> >> >>>>> external_port_range column, and we can choose something like
> >> >>>>> 10000-30000.
> >> >>>>>
> >> >>>>
> >> >>>> Tim, does this look OK to you?  If it's acceptable to limit the
> SNAT
> >> >>>> port range this workaround should be fine.
> >> >>>>
> >> >>>>> An extra benefit of this is that it reduces a CT recirc.
> >> >>>>>
> >> >>>>> Alternatively solutions are:
> >> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This
> would
> >> >>>>> introduce unnecessary cost of CT entries and extra CT recirc.
> >> >>>>
> >> >>>> Extra recirculation aside, I would actually love it if we could
> use this
> >> >>>> alternative.  I think it's the "most correct" option.  I think it
> would
> >> >>>> allow us to avoid other workarounds like:
> >> >>>>
> >> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
> >> >>>>
> >> >>>> or
> >> >>>>
> >> >>>>
> >> >>
> https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amu...@redhat.com/
> >> >>>>
> >> >>>> I do understand the worry about the extra recirculation though.
> In the
> >> >>>> HWOL context does that cause visible performance impact?
> >> >>>>
> >> >>>> We'd probably have to do more performance testing without HWOL to
> figure
> >> >>>> out the impact in the software datapath.
> >> >>>>
> >> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are
> other
> >> >>>>> issues reported for using the common zone [1]
> >> >>>>>
> >> >>>>> Could you let me know if other thoughts on this?
> >> >>>>>
> >> >>>>
> >> >>>> On a related note, I know it has been discussed in different
> settings
> >> >>>> but I don't think this ever moved forward: would it be possible for
> >> >>>> NVIDIA to help out with automatically testing HWOL impact for
> incoming
> >> >>>> patches?
> >> >>>>
> >> >>>> Maybe we could some "simple" system-like tests that ensure that
> traffic
> >> >>>> is correctly offloaded in common scenarios?  Alternatively, I
> guess we
> >> >>>> could also tag a subset of the existing system tests and just run
> those
> >> >>>> on actual hardware?
> >> >>>>
> >> >>>> It's quite simple (AFAIU) for external CIs to report status on
> each OVN
> >> >>>> patch posted on patchwork.  That would at least allow us to flag
> this
> >> >>>> kind of breakages early (even before they get merged).
> >> >>>>
> >> >>>> What do you think?
> >> >>>>
> >> >>>>> [0] https://issues.redhat.com/browse/FDP-291
> >> >>>>> [1] b8c40e7593
> >> >>
> https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
> >> >>>>>
> >> >>>>> Thanks,
> >> >>>>> Han
> >> >>>>>
> >> >>>>
> >> >>>> Thanks,
> >> >>>> Dumitru
> >> >> _______________________________________________
> >> >> dev mailing list
> >> >> d...@openvswitch.org
> >> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn] northd: Don't skip the unSNAT stage for traffic towards VIPs.

Reply via email to