On Mon, May 19, 2025 at 8:41 AM Tim Rozet <tro...@redhat.com> wrote:
>
> Hi All,
> Sorry Han that you are still facing issues with this problem. It seems
like the best option would be to configure the range of ephemeral ports for
OVN to use when doing an SNAT. Note, in OpenShift the nodeport service
range is configured 30000–32767, and the ephemeral port range in the kernel
is set to:
> net.ipv4.ip_local_port_range = 32768 60999
>
> I think it would make sense to revert and add an option to make this
range configurable in OVN. Then OVN-Kubernetes should pass along this
config to OVN.
>
> As Dumitru mentioned, we can't just revert the previous patch without a
substitute fix in place, otherwise we would risk regressing.

Thanks Tim! For the substitute fix, I assume you meant a fix in ovn-k8s,
right?

OVN already provides configurable SNAT port range through the
external_port_range column in the NAT table. See:
https://man7.org/linux/man-pages/man5/ovn-nb.5.html

       external_port_range: string
              L4 source port range

              Range of ports, from which a port number will be picked
              that will replace the source port of to be NATed packet.
              This is basically PAT (port address translation).

              Value of the column is in the format, port_lo-port_hi. For
              example: external_port_range : "1-30000"

              Valid range of ports is 1-65535.

Does this satisfy the need you asked?

If this is agreed, I think we can move forward with below steps:
1. Update ovn-kubernetes to set the external_port_range, if specific ranges
are configured. (only needed when k8s configured nodeport range conflicts
with the kernel ephemeral port range, and it seems unnecessary for
OpenShift?)
2. Revert the patch "northd: Don't skip the unSNAT stage for traffic
towards VIPs" in OVN. Also revert some of the changes for ct-commit-all:
e.g. no need to commit DNATted connections to SNAT zone, and vice versa.

What do you think?

Thanks,
Han

>
> Thanks,
>
> Tim Rozet
> Red Hat OpenShift Networking Team
>
>
> On Fri, May 16, 2025 at 7:59 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>
>> On 5/16/25 1:22 PM, Dumitru Ceara wrote:
>> > On 5/15/25 6:24 PM, Han Zhou wrote:
>> >> On Fri, Aug 30, 2024 at 1:04 PM Tim Rozet <tro...@redhat.com> wrote:
>> >>>
>> >>> Hi Han,
>> >>> My understanding is that the unSNAT stage was originally being
skipped
>> >> for LB ports, then Dumitru changed it to attempt unSNAT, and this
breaks
>> >> HWOL for ingress connections that never need to get unSNAT'ed and
therefore
>> >> not committed. In the OVNK use case for pods simply connecting egress,
>> >> almost all packets will be SNAT'ed to the node IP. Therefore on egress
>> >> reply, we always enter the unSNAT stage for these packets. So this
skipping
>> >> unSNAT stage seems to be specific to LB related traffic. While I agree
>> >> splitting the port range would work, I think it makes more sense to
just
>> >> always commit in the unSNAT stage for all traffic. I get there is a
>> >> performance hit there, but the datapath pipeline seems more
consistent and
>> >> I think outweighs the cost of committing LB traffic.
>> >>
>> >> Hi Tim and folks,
>> >>
>> >
>> > Hi Han,
>> >
>> >> + @Ales Musil <amu...@redhat.com>
>> >>
>> >> We've spent a lot of effort on the alternative solution: Always
commit to
>> >> both SNAT and DNAT zones. Unfortunately, the HW offload issue
discussed in
>> >> this thread is still unresolved. I'd like to discuss the current
situation
>> >> and ask for opinions.
>> >>
>> >
>> > Thanks again for working on this!
>> >
>> >> A while ago, Ales tried to fix it with this commit: 800fd0681579
("northd:
>> >> Add LR option to commit all traffic."). When the ct-commit-all option
is
>> >> enabled, it commits the connection to both SNAT and DNAT zones.
However, in
>> >> the ovn-k8s nodeport scenario (a very typical use case) the HW
offload is
>> >> still broken. In that scenario, a packet may come from external
network
>> >> with dest IP being a node's IP, which is also used as an OVN gateway
router
>> >> IP. The dest IP is firstly DNATed to a LB backend IP, and then the
src IP
>> >> is SNATed to another LRP's IP on this router. The src/dst IPs in
different
>> >> stages are:
>> >>
>> >> Original: EXT IP <-> GR IP1
>> >> After DNAT: EXT IP <-> BE IP
>> >> After SNAT: GR IP2 <-> BE IP
>> >>
>> >> With the ct-commit-all option, it commits after DNAT and after SNAT:
>> >> DNAT zone: EXT IP <-> GR IP1/BE IP
>> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
>> >>
>> >> So, at the UNSNAT stage, the packet traverses the SNAT zone with the
>> >> original header, which will still always be "new" state, which breaks
HW
>> >> offload.
>> >>
>> >
>> > Ah, now I understand the problem you were facing.
>> >
>> >> I tried to fix it by also committing to the SNAT zone before DNAT (at
the
>> >> POST_SNAT stage). Now there are 3 entries committed:
>> >> SNAT zone: EXT IP <-> GR IP1
>> >> DNAT zone: EXT IP <-> GR IP1/BE IP
>> >> SNAT zone: EXT IP/GR IP2 <-> BE IP
>> >>
>> >> HW offload worked, but it breaks at least the scenario of "DNAT and
SNAT on
>> >> distributed router - E/W", which tests communication between two
workloads
>> >> using "floating IPs". In this scenario, both workloads are under
different
>> >> LSes behind the same router, and both have their own floating IPs
>> >> (dnat_and_snat) on the public side of the LR. When one of them sends
>> >> packets to the other using the floating IP, what's expected is:
>> >>
>> >> Original: overlay IP1 <-> floating IP2
>> >> After SNAT: floating IP1 <-> floating IP2
>> >> After DNAT: floating IP1 <-> overlay IP2
>> >>
>> >> With my fix, the packet is firstly SNATed with CT entry in SNAT zone:
>> >> SNAT: overlay IP1/floating IP1 <-> floating IP2
>> >>
>> >> This happens in the egress pipeline of the LR, and then it enters
into the
>> >> ingress pipeline again because this is like a hairpin (the dest IP is
a
>> >> dnat_and_snat IP on the same router). It will hit the UNSNAT stage
first,
>> >> traverse the CT zone, and for the first packet the reply is not seen
yet,
>> >> so the CT state is still new, so in the POST_UNSNAT we will try to
commit
>> >> it (because of the issue I am trying to fix above). Because the
connection
>> >> floating IP1 <-> floating IP2 is conflicting with the above committed
entry
>> >> "SNAT: overlay IP1/floating IP1 <-> floating IP2", the CT action
fails and
>> >> the packet is dropped.
>> >>
>> >> There is still a way to solve this, probably by skipping the
unnecessary
>> >> UNSNAT traversal and the conflict commit if the source IP is one of
the
>> >> snat/dnat_and_snat IPs of this same router.
>> >>
>> >> However, as we see, the NAT related pipelines have been very complex
today,
>> >> and with the above changes it would become even much more complex. I
wonder
>> >> if this is maintainable in the long run.
>> >>
>> >> In addition, I am not even sure if this is the last problem of my
patch. It
>> >> is not easy to tell because today the ct-commit-all feature is an
option
>> >> and by default disabled. So all our test cases by default don't test
for
>> >> this option except the 4 test cases added by Ales's patch. I had to
>> >> manually set this option to true as default and of course many test
cases
>> >> will fail. Most of them fail because of the expected flows won't
match, but
>> >> I can't tell for sure unless updating every case with new flows and
then
>> >> retest them. Before doing that I just examined ones I think might be
>> >> impacted by the patch and the "DNAT and SNAT on distributed router -
E/W"
>> >> is the first broken one I found. So there can be other issues not
>> >> discovered yet.
>> >>
>> >> Moreover, we should also keep in mind the performance penalty of these
>> >> extra CT commit & traversal.
>> >>
>> >
>> > But basically this happens because we use the same SNAT zone for all
>> > router ports.
>> >
>> > In an ideal world wouldn't using a unique CT SNAT and CT DNAT zone per
>> > router port fix it?  E.g.:
>> >
>> > host1 ---- LRP1 (SNAT ZONE: 11, DNAT ZONE:12) [LR] LRP2 (SNAT ZONE: 21,
>> > DNAT ZONE: 22) ---- host2
>> >
>> > That'd also imply that we commit that session (from host1 to host2 -
>> > potentially via LB and SNAT) to 4 different zones (SNAT and DNAT for
>> > LRP1 and SNAT and DNAT for LRP2).  And we'd also have to do 4 different
>> > CT lookups for all packets traversing the router.
>> >
>> > I didn't try any of this but I have the impression that this would
solve
>> > the HWOL issue: all subsequent packets are part of known and committed
>> > sessions in all 4 conntrack zones.
>> >
>> > So what if we change the implementation of the "commit-all" feature to
>> > add the missing ct commits?  Wouldn't that work?
>> >
>> > It feels like (lookup performance hit aside) this is the correct way to
>> > forward traffic, i.e., never traverse a CT zone without having a
>> > committed entry in that zone.
>> >
>> >> I'd like to listen to your opinion before moving forward. Is this
worth the
>> >> effort, or shall we just keep the pipeline simple (relatively) and
avoid
>> >> the original problem (https://issues.redhat.com/browse/FDP-291) by
>> >> separating the port range of SNAT and DNAT, the first proposal at the
>> >> beginning of this thread?
>> >>
>> >
>> > Now that I understand the problem you're describing better it might be
>> > worth reverting the fix for FDP-291 like you said until we find a
proper
>> > generic solution.
>> >
>> > However, we can't afford to do that without separating the port range
of
>> > SNAT and DNAT like you said [0] because we'd be regressing (at least
>> > from OpenShift perspective).
>> >
>>
>> For reference, Tim, this is the OpenShift bug that needed the fix for
>> FDP-291:
>>
>> https://issues.redhat.com/browse/OCPBUGS-25889
>>
>> > Because selecting the nodeport (DNAT) range is done through a k8s API
>> > server cmdline argument that means that users can in theory choose
>> > whatever range they want.  I guess we'd need a command line argument in
>> > ovn-kubernetes to allow users (operators, etc.) to inform
ovn-kubernetes
>> > what SNAT port range to use so that it doesn't conflict with the KAPI
>> > nodeport range.
>> >
>> > Tim, you're the expert on this side, what are your thoughts on the
matter?
>> >
>> > Regards,
>> > Dumitru
>> >
>> > [0]
https://mail.openvswitch.org/pipermail/ovs-dev/2024-August/416974.html
>> >
>> >> Thanks,
>> >> Han
>> >>
>> >>>
>> >>> Thanks,
>> >>> Tim Rozet
>> >>> Red Hat OpenShift Networking Team
>> >>>
>> >>>
>> >>> On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <dce...@redhat.com>
wrote:
>> >>>>
>> >>>> On 8/29/24 18:14, Han Zhou wrote:
>> >>>>> On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <dce...@redhat.com>
>> >> wrote:
>> >>>>>>
>> >>>>>> On 3/5/24 15:56, Numan Siddique wrote:
>> >>>>>>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <dce...@redhat.com>
>> >> wrote:
>> >>>>>>>>>>>>>> Otherwise, in case there's also a SNAT rule that uses the
>> >> VIP as
>> >>>>>>>> external IP, we break sessions initiated from behind the VIP.
>> >>>>>>>>
>> >>>>>>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat
flows
>> >> for
>> >>>>>>>> load balancer vips in router ingress pipeline").  That's OK
because
>> >>>>>>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers
>> >> with
>> >>>>>>>> DNAT/Load Balancers") addressed the original issue in a better
way:
>> >>>>>>>>
>> >>>>>>>>     In the reply direction, the order of traversal of the tables
>> >>>>>>>>     "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds
incorrect
>> >>>>>>>>     datapath flows that check ct_state in the wrong conntrack
zone.
>> >>>>>>>>     This is illustrated below where reply trafic enters the
>> >> physical host
>> >>>>>>>>     port (6) and traverses DNAT zone (14), SNAT zone (default),
>> >> back to the
>> >>>>>>>>     DNAT zone and then on to Logical Switch Port zone (22). The
>> >> third
>> >>>>>>>>     flow is incorrectly checking the state from the SNAT zone
>> >> instead
>> >>>>>>>>     of the DNAT zone.
>> >>>>>>>>
>> >>>>>>>> We also add a system test to ensure traffic initiated from
behind a
>> >> VIP
>> >>>>>>>> + SNAT is not broken.
>> >>>>>>>>
>> >>>>>>>> Another nice side effect is that the northd I-P is slightly
>> >> simplified
>> >>>>>>>> because we don't need to track NAT external IPs anymore.
>> >>>>>>>>
>> >>>>>>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load
>> >> balancer vips in router ingress pipeline")
>> >>>>>>>> Reported-at: https://issues.redhat.com/browse/FDP-291
>> >>>>>>>> Signed-off-by: Dumitru Ceara <dce...@redhat.com>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Thanks for the fix.  It also simplified the lr-nat-stateful code.
>> >>>>>>>
>> >>>>>>> Acked-by: Numan Siddique <num...@ovn.org>
>> >>>>>>>
>> >>>>>>
>> >>>>>> Thanks, Numan!
>> >>>>>>
>> >>>>>> Applied to main and backported to all branches down to 22.03.
>> >>>>>>
>> >>>>>> Regards,
>> >>>>>> Dumitru
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> dev mailing list
>> >>>>>> d...@openvswitch.org
>> >>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>> >>>>>
>> >>>>> Hi Dumitru, Numan, Tim and folks,
>> >>>>>
>> >>>>
>> >>>> Hi Han,
>> >>>>
>> >>>>> I noticed that the HW offload of k8s nodePort traffic is broken
due to
>> >>>>> this change. The reason is that for client to nodePort (LB with VIP
>> >>>>> being the node IP) traffic, when the packet is going through the
>> >>>>> unSNAT stage in the SNAT CT zone, since the entry is never
committed
>> >>>>> to the SNAT zone, it will have CT state returned as "new", which
>> >>>>> prevents the HW offload to work for such packets.
>> >>>>>
>> >>>>
>> >>>> Sorry about that, I forgot we don't always commit in the SNAT zone.
>> >>>>
>> >>>>> At the moment I have to revert this change in our downstream. For
the
>> >>>>> problem that was fixed by this change [0], I think we can avoid it
by
>> >>>>> separating the port range of SNAT and DNAT. For DNAT, the nodePort
>> >>>>> range in k8s is configured by API-server option:
>> >>>>>
>> >>>>> --service-node-port-range <a string in the form 'N1-N2'> Default:
>> >> 30000-32767
>> >>>>>
>> >>>>> For SNAT, it can be configured in the OVN's NAT table's
>> >>>>> external_port_range column, and we can choose something like
>> >>>>> 10000-30000.
>> >>>>>
>> >>>>
>> >>>> Tim, does this look OK to you?  If it's acceptable to limit the SNAT
>> >>>> port range this workaround should be fine.
>> >>>>
>> >>>>> An extra benefit of this is that it reduces a CT recirc.
>> >>>>>
>> >>>>> Alternatively solutions are:
>> >>>>> Alternative1: Always commit to both SNAT and DNAT zones. This would
>> >>>>> introduce unnecessary cost of CT entries and extra CT recirc.
>> >>>>
>> >>>> Extra recirculation aside, I would actually love it if we could use
this
>> >>>> alternative.  I think it's the "most correct" option.  I think it
would
>> >>>> allow us to avoid other workarounds like:
>> >>>>
>> >>>> https://github.com/ovn-org/ovn/commit/40136a2f2c8
>> >>>>
>> >>>> or
>> >>>>
>> >>>>
>> >>
https://patchwork.ozlabs.org/project/ovn/patch/20240827085252.458355-1-amu...@redhat.com/
>> >>>>
>> >>>> I do understand the worry about the extra recirculation though.  In
the
>> >>>> HWOL context does that cause visible performance impact?
>> >>>>
>> >>>> We'd probably have to do more performance testing without HWOL to
figure
>> >>>> out the impact in the software datapath.
>> >>>>
>> >>>>> Alternative2: Use a common zone for SNAT and DNAT. But there are
other
>> >>>>> issues reported for using the common zone [1]
>> >>>>>
>> >>>>> Could you let me know if other thoughts on this?
>> >>>>>
>> >>>>
>> >>>> On a related note, I know it has been discussed in different
settings
>> >>>> but I don't think this ever moved forward: would it be possible for
>> >>>> NVIDIA to help out with automatically testing HWOL impact for
incoming
>> >>>> patches?
>> >>>>
>> >>>> Maybe we could some "simple" system-like tests that ensure that
traffic
>> >>>> is correctly offloaded in common scenarios?  Alternatively, I guess
we
>> >>>> could also tag a subset of the existing system tests and just run
those
>> >>>> on actual hardware?
>> >>>>
>> >>>> It's quite simple (AFAIU) for external CIs to report status on each
OVN
>> >>>> patch posted on patchwork.  That would at least allow us to flag
this
>> >>>> kind of breakages early (even before they get merged).
>> >>>>
>> >>>> What do you think?
>> >>>>
>> >>>>> [0] https://issues.redhat.com/browse/FDP-291
>> >>>>> [1] b8c40e7593
>> >>
https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Han
>> >>>>>
>> >>>>
>> >>>> Thanks,
>> >>>> Dumitru
>> >> _______________________________________________
>> >> dev mailing list
>> >> d...@openvswitch.org
>> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to