Hi Han, My understanding is that the unSNAT stage was originally being skipped for LB ports, then Dumitru changed it to attempt unSNAT, and this breaks HWOL for ingress connections that never need to get unSNAT'ed and therefore not committed. In the OVNK use case for pods simply connecting egress, almost all packets will be SNAT'ed to the node IP. Therefore on egress reply, we always enter the unSNAT stage for these packets. So this skipping unSNAT stage seems to be specific to LB related traffic. While I agree splitting the port range would work, I think it makes more sense to just always commit in the unSNAT stage for all traffic. I get there is a performance hit there, but the datapath pipeline seems more consistent and I think outweighs the cost of committing LB traffic.
Thanks, Tim Rozet Red Hat OpenShift Networking Team On Fri, Aug 30, 2024 at 5:26 AM Dumitru Ceara <[email protected]> wrote: > On 8/29/24 18:14, Han Zhou wrote: > > On Wed, Mar 6, 2024 at 11:13 AM Dumitru Ceara <[email protected]> wrote: > >> > >> On 3/5/24 15:56, Numan Siddique wrote: > >>> On Mon, Feb 26, 2024 at 7:59 AM Dumitru Ceara <[email protected]> > wrote: > >>>> > >>>> Otherwise, in case there's also a SNAT rule that uses the VIP as > >>>> external IP, we break sessions initiated from behind the VIP. > >>>> > >>>> This partially reverts 832893bdbb42 ("ovn-northd: Skip unsnat flows > for > >>>> load balancer vips in router ingress pipeline"). That's OK because > >>>> commit 384a7c6237da ("northd: Refactor Logical Flows for routers with > >>>> DNAT/Load Balancers") addressed the original issue in a better way: > >>>> > >>>> In the reply direction, the order of traversal of the tables > >>>> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect > >>>> datapath flows that check ct_state in the wrong conntrack zone. > >>>> This is illustrated below where reply trafic enters the physical > host > >>>> port (6) and traverses DNAT zone (14), SNAT zone (default), back > to the > >>>> DNAT zone and then on to Logical Switch Port zone (22). The third > >>>> flow is incorrectly checking the state from the SNAT zone instead > >>>> of the DNAT zone. > >>>> > >>>> We also add a system test to ensure traffic initiated from behind a > VIP > >>>> + SNAT is not broken. > >>>> > >>>> Another nice side effect is that the northd I-P is slightly simplified > >>>> because we don't need to track NAT external IPs anymore. > >>>> > >>>> Fixes: 832893bdbb42 ("ovn-northd: Skip unsnat flows for load balancer > vips in router ingress pipeline") > >>>> Reported-at: https://issues.redhat.com/browse/FDP-291 > >>>> Signed-off-by: Dumitru Ceara <[email protected]> > >>> > >>> > >>> Thanks for the fix. It also simplified the lr-nat-stateful code. > >>> > >>> Acked-by: Numan Siddique <[email protected]> > >>> > >> > >> Thanks, Numan! > >> > >> Applied to main and backported to all branches down to 22.03. > >> > >> Regards, > >> Dumitru > >> > >> _______________________________________________ > >> dev mailing list > >> [email protected] > >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > Hi Dumitru, Numan, Tim and folks, > > > > Hi Han, > > > I noticed that the HW offload of k8s nodePort traffic is broken due to > > this change. The reason is that for client to nodePort (LB with VIP > > being the node IP) traffic, when the packet is going through the > > unSNAT stage in the SNAT CT zone, since the entry is never committed > > to the SNAT zone, it will have CT state returned as "new", which > > prevents the HW offload to work for such packets. > > > > Sorry about that, I forgot we don't always commit in the SNAT zone. > > > At the moment I have to revert this change in our downstream. For the > > problem that was fixed by this change [0], I think we can avoid it by > > separating the port range of SNAT and DNAT. For DNAT, the nodePort > > range in k8s is configured by API-server option: > > > > --service-node-port-range <a string in the form 'N1-N2'> Default: > 30000-32767 > > > > For SNAT, it can be configured in the OVN's NAT table's > > external_port_range column, and we can choose something like > > 10000-30000. > > > > Tim, does this look OK to you? If it's acceptable to limit the SNAT > port range this workaround should be fine. > > > An extra benefit of this is that it reduces a CT recirc. > > > > Alternatively solutions are: > > Alternative1: Always commit to both SNAT and DNAT zones. This would > > introduce unnecessary cost of CT entries and extra CT recirc. > > Extra recirculation aside, I would actually love it if we could use this > alternative. I think it's the "most correct" option. I think it would > allow us to avoid other workarounds like: > > https://github.com/ovn-org/ovn/commit/40136a2f2c8 > > or > > > https://patchwork.ozlabs.org/project/ovn/patch/[email protected]/ > > I do understand the worry about the extra recirculation though. In the > HWOL context does that cause visible performance impact? > > We'd probably have to do more performance testing without HWOL to figure > out the impact in the software datapath. > > > Alternative2: Use a common zone for SNAT and DNAT. But there are other > > issues reported for using the common zone [1] > > > > Could you let me know if other thoughts on this? > > > > On a related note, I know it has been discussed in different settings > but I don't think this ever moved forward: would it be possible for > NVIDIA to help out with automatically testing HWOL impact for incoming > patches? > > Maybe we could some "simple" system-like tests that ensure that traffic > is correctly offloaded in common scenarios? Alternatively, I guess we > could also tag a subset of the existing system tests and just run those > on actual hardware? > > It's quite simple (AFAIU) for external CIs to report status on each OVN > patch posted on patchwork. That would at least allow us to flag this > kind of breakages early (even before they get merged). > > What do you think? > > > [0] https://issues.redhat.com/browse/FDP-291 > > [1] b8c40e7593 > https://github.com/ovn-org/ovn/commit/b8c40e7593a9fa40a057268c507a912d67b99ec4 > > > > Thanks, > > Han > > > > Thanks, > Dumitru > > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
