On Thu, 2024-02-08 at 09:30 +0100, Ales Musil wrote:
> 
> 
> On Thu, Feb 8, 2024 at 1:22 AM Mark Michelson <[email protected]>
> wrote:
> > Hi Martin
> > 
> > Thanks a bunch for your first OVN contribution. I have comments
> > below.
> > 
> > On 2/7/24 10:56, Martin Kalcok wrote:
> > > When a machine from an external network tries to access machine
> > > in the
> > > OVN's internal network with SNAT enabled, using protocol like
> > > TCP, it
> > > receives connection reset after first packets are successfully
> > > exchanged.
> > > 
> > > This setup may be a bit unusual and requires that whatever is
> > > responsible
> > > for routing in the external network, routes packets for OVN's
> > > internal
> > > network via LR's external port IP. However it is something that
> > > used to
> > > work in ML2/OVS Openstack deployments and got broken after
> > > upgrade to
> > > ML2/OVN.
> > > 
> > > To demonstrate what the traffic looked like from the point of
> > > view of the
> > > external machine, lets say we have:
> > >     * E  (IP of machine in the external network)
> > >     * I  (IP of machine in OVN's internal network)
> > >     * LR (IP of OVN's Logical Router Port connected to external
> > > network)
> > > 
> > > The traffic looks like this:
> > >     * [SYN]      E  -> I
> > >     * [SYN,ACK]  I  -> E
> > >     * [ACK]      E  -> I
> > >     * [PSH,ACK]  E  -> I
> > >     * [ACK]      LR -> E
> > > 
> > > Connection gets reseted after it receives ACK from an unexpected
> > > IP (LR).
> > > 
> > > Although I'm not completely sure why the first [SYN,ACK] reply
> > > got back
> > > without SNAT, root cause of this issue is that traffic initiated
> > > from
> > > the external network is not tracked in CT. This causes even
> > > traffic that's
> > > in the "reply" direction (i.e. it was initiated from outside the
> > > SNAT
> > > network), to be translated by Logical Router.
> > > 
> > > My proposal is to initiate CT and commit "new" connections
> > > destined
> > > to the internal network (with enabled SNAT) on Logical Router.
> > > This
> > > will enable SNAT rules to translate only traffic that originates
> > > from
> > > internal networks.
> > > 
> > > Signed-off-by: Martin Kalcok <[email protected]>
> > > ---
> > >    northd/northd.c | 70
> > > ++++++++++++++++++++++++++++++++++++++++++-------
> > >    1 file changed, 61 insertions(+), 9 deletions(-)
> > > 
> > > As this would be my first "real" contribution to the OVN, I'd
> > > love to hear
> > > some feeback on my approach while I work on performance
> > > measurments and
> > > tests, which I plan on including in final proposal.
> > > I also left some inline XXX comments marking parts that I'm not
> > > sure about
> > > and need resolving before final proposal.
> > > I tested this change in my local Openstack deployment with
> > > distributed gw
> > > chassis and it seems to solve the issue regardless of the VM's
> > > placement,
> > > whether it does or does not have a floating IP. It also doesn't
> > > seem to
> > > break outbound connectivity, nor connectivity to VM's floating
> > > IPs.
> > > Thank you, for any reviews/sugestions.
> > 
> > Without getting into the details of this particular patch, I have
> > some 
> > general suggestions for contributing:
> > 
> > 1) Before sending a patch, run utilities/checkpatch.py against all 
> > patches in the series. This can help 0-day Robot not to complain
> > about 
> > coding guidelines violations.
> > 
> > 2) Run the testsuite locally before posting. If you see test
> > errors, 
> > double check if your change is what's causing them.

Sorry about the failing tests. I ran the unit tests a checked the
errors (that were caused by me) but given that I intended this as an
RFC and I wasn't completely sure about my approach I didn't want to
spend too much effort on fixing/adding unit tests that might get
scraped. In the hindsight, I should've put more effort in and post a
clean work for the review. 

> > 
> > 3) As a sanity check, consider pushing the code to your github
> > fork, and 
> > ensure that github actions complete without error. If there are any
> > errors, double check that they're not the fault of github (e.g.
> > it's not 
> > your fault if the action fails because the machine runs out of disk
> > space while allocating pods). In this case, The OVS robot links to 
> > https://github.com/ovsrobot/ovn/actions/runs/7817861729/job/21326755181
> > , where there are some compilation errors reported. I've pointed
> > out the 
> > offending lines in the diff below.
> > 
> > 4) Consider adding test cases. In this case, I suggest adding a
> > couple:
> >    * First, you can add one to ovn-northd.at that strictly tests
> > whether 
> > the generated logical flows for SNAT are as expected. It would also
> > be 
> > good to ensure that the flows are updated properly when the SNAT 
> > configuration changes on the logical router.
> >    * Second, I would add a test to system-ovn.at that simulates the
> > broken scenario. It could ensure that packets originated internally
> > are 
> > SNATted, and that replies to packets originated externally are not
> > SNATted.

Thank you, these are great starting points.

> > 
> > > 
> > > diff --git a/northd/northd.c b/northd/northd.c
> > > index 01eec64ca..b695932fd 100644
> > > --- a/northd/northd.c
> > > +++ b/northd/northd.c
> > > @@ -10772,6 +10772,12 @@ build_ecmp_route_flow(struct lflow_table
> > > *lflows, struct ovn_datapath *od,
> > >        ds_destroy(&actions);
> > >    }
> > >    
> > > +static inline bool
> > > +lrouter_use_common_zone(const struct ovn_datapath *od)
> > > +{
> > > +    return !od->is_gw_router && use_common_zone;
> > > +}
> > > +
> > >    static void
> > >    add_route(struct lflow_table *lflows, struct ovn_datapath *od,
> > >              const struct ovn_port *op, const char *lrp_addr_s,
> > > @@ -10796,6 +10802,9 @@ add_route(struct lflow_table *lflows,
> > > struct ovn_datapath *od,
> > 
> > I'm not sure that add_route() is a good fit for this new code. 
> > add_route() is called for every static route, logical router port
> > IP, 
> > load balancer VIP, and NAT external address. It doesn't really make
> > sense to be iterating through the router's NATs when adding a
> > static 
> > route, for instance. Further, this has the potential to attempt to
> > add 
> > duplicate flows since a router can (and likely will) have
> > add_route() 
> > called multiple times for its addresses.
> > 
> > If you need to add new flows related to the configured NAT rules on
> > a 
> > router, I'd suggest looking for an appropriate place in 
> > build_lrouter_nat_defrag_and_lb() or the functions it calls. We
> > iterate 
> > over the configured NAT rules there, so it's a natural fit for
> > these 
> > types of changes.

Thanks, I'll move the logic around and insert rules into more
appropriate tables. My original proposal was based mostly on injecting
rules into tables that I saw traffic hitting on the way in/out in the
ovn-trace. I strongly suspected that they might not be correct.

> > 
> > >        build_route_match(op_inport, rtb_id, network_s, plen,
> > > is_src_route,
> > >                          is_ipv4, &match, &priority, ofs);
> > >    
> > > +    bool use_ct = false;
> > > +    struct ds target_net = DS_EMPTY_INITIALIZER;
> > > +    struct ds snat_port_match = DS_EMPTY_INITIALIZER;
> > >        struct ds common_actions = DS_EMPTY_INITIALIZER;
> > >        struct ds actions = DS_EMPTY_INITIALIZER;
> > >        if (is_discard_route) {
> > > @@ -10808,16 +10817,61 @@ add_route(struct lflow_table *lflows,
> > > struct ovn_datapath *od,
> > >            } else {
> > >                ds_put_format(&common_actions, "ip%s.dst", is_ipv4
> > > ? "4" : "6");
> > >            }
> > > +
> > > +        /* To enable direct connectivity from external networks
> > > to Logical IPs
> > > +         * of VMs/Containers in the SNATed networks, we need to
> > > track
> > > +         * connections initiated from external networks.  This
> > > allows SNAT
> > > +         * rules to avoid SNATing packets in "reply" direction.
> > > */
> > > +        if (od->nbr && od->nbr->n_nat) {
> > > +            ds_put_format(&target_net, "%s/%d", network_s,
> > > plen);
> > > +            for (int i = 0; i < od->nbr->n_nat; i++) {
> > > +                const struct nbrec_nat *nat = od->nbr->nat[i];
> > > +                if (strcmp(nat->type, "snat") ||
> > > +                    strcmp(nat->logical_ip,
> > > ds_cstr(&target_net)) ||
> > 
> > We allow IPv6 addresses for NAT. strcmp isn't a good way to compare
> > IPv6 
> > addresses since the same address can have many string
> > representations.

Ack, thanks.

> > 
> > > +                    lrouter_use_common_zone(od)) {
> > > +                    continue;
> > > +                }
> > > +
> > > +                /* Initiate connection tracking for traffic
> > > destined to
> > > +                 * SNATed network. */
> > > +                use_ct = true;
> > > +
> > > +                /* XXX: Need to figure out what is appropraite
> > > priority for these rules.
> > > +                        All they require is to be after any
> > > existing specific rules
> > > +                        and before the default rule.*/
> > 
> > There aren't any hard and fast rules for picking a priority number,
> > other than what you stated in this comment. As long as it gets the
> > job 
> > done, you can pick any appropriate priority number. It's best to
> > try to 
> > leave some room for new priorities to be inserted in case new rules
> > arise, but other than that, pick whatever you want.

Ack.

> > 
> > > +
> > > +                /* Commit new connections initiated from the
> > > outside of
> > > +                 * SNATed network to CT. */
> > > +                ds_put_format(&snat_port_match,
> > > +                              "outport == %s && ct.new",
> > > +                              op->json_key);
> > > +                ovn_lflow_add_with_hint(lflows, od,
> > > S_ROUTER_IN_ARP_REQUEST, 5,
> > > +                                       
> > > ds_cstr(&snat_port_match),
> > > +                                        "ct_commit; output;",
> > > stage_hint);
> > 
> > I'm not sure I understand the choice of S_ROUTER_IN_ARP_REQUEST for
> > performing the ct_commit; This doesn't have anything to do with ARP
> > requests, so it seems like a different table would fit better here.

So this was one of my biggest struggles. If I understand connection
tracking in OVN correctly (from docs/experimenting), in order to have
access to connection state (`ct.new`/`ct.rpl`/...) or to use
`ct_commit`, it must first be initiated with `ct_next`. This initiates
connection tracking information for the flow and moves on to the next
table (stage?). This meant for me that I had to alter at least two
tables in each direction. One that would call `ct_next` and another
that would use this initiated context to perform either `ct_commit` or
match on fields like `ct.rpl`.
The reason why I put `ct_commit` here was that it was the last stage in
ingress pipeline and packet was guaranteed to hit it before being
processed by egress pipeline.
Maybe I just missed it, but I didn't find an action that would initiate
CT information (like `ct_next`) and resubmit the packet to the same
table instead fo going to the next. If there was an option like this, I
could keep changes to the single stage.

> > 
> > > +                ds_clear(&snat_port_match);
> > > +
> > > +                /* Initiate connection tracking for traffic from
> > > SNATed
> > > +                 * network to enable rules in S_ROUTER_OUT_SNAT
> > > disregard
> > > +                 * traffic in "reply" direction. */
> > > +                ds_put_format(&snat_port_match, "inport == %s",
> > > op->json_key);
> > > +                ovn_lflow_add_with_hint(lflows, od,
> > > S_ROUTER_OUT_POST_UNDNAT,
> > > +                                        5,
> > > ds_cstr(&snat_port_match),
> > > +                                        "ct_next;", stage_hint);
> > 
> > Your two calls to ovn_lflow_add_with_hint() are missing the final 
> > lflow_ref parameter, which is why github actions failed when
> > testing.

Ack, thanks.

> > 
> > > +                break;
> > > +            }
> > > +        }
> > > +
> > >            ds_put_format(&common_actions, "; "
> > >                          "%s = %s; "
> > >                          "eth.src = %s; "
> > >                          "outport = %s; "
> > >                          "flags.loopback = 1; "
> > > -                      "next;",
> > > +                      "%s",
> > >                          is_ipv4 ? REG_SRC_IPV4 : REG_SRC_IPV6,
> > >                          lrp_addr_s,
> > >                          op->lrp_networks.ea_s,
> > > -                      op->json_key);
> > > +                      op->json_key,
> > > +                      use_ct ? "ct_next;" : "next;");
> > 
> > I'm a bit confused by this addition. In your scenario, E sends
> > traffic 
> > to I, and I sends traffic to E. Traffic is never destined for a LRP
> > address, load balancer VIP, or NAT external address. So I don't
> > know why 
> > this flow would be matched in your scenario. Did I miss a detail
> > somewhere?

This flow would match because "I" is part of destination network for
which this `add_route` is called.

> > 
> > Even if this is necessary for your scenario, I think this is adding
> > conntrack to situations where it shouldn't be present, such as
> > traffic 
> > destined for static routes.

I tried to limit conntract initiation only for traffic destined to
networks with enabled SNAT. Anyway, as you pointed earlier, these rules
should probably go elsewhere.

> > 
> > >            ds_put_format(&actions, "ip.ttl--; %s",
> > > ds_cstr(&common_actions));
> > >        }
> > >    
> > > @@ -10833,6 +10887,8 @@ add_route(struct lflow_table *lflows,
> > > struct ovn_datapath *od,
> > >                                    ds_cstr(&common_actions),\
> > >                                    stage_hint, lflow_ref);
> > >        }
> > > +    ds_destroy(&snat_port_match);
> > > +    ds_destroy(&target_net);
> > >        ds_destroy(&match);
> > >        ds_destroy(&common_actions);
> > >        ds_destroy(&actions);
> > > @@ -10933,12 +10989,6 @@ struct lrouter_nat_lb_flows_ctx {
> > >        const struct shash *meter_groups;
> > >    };
> > >    
> > > -static inline bool
> > > -lrouter_use_common_zone(const struct ovn_datapath *od)
> > > -{
> > > -    return !od->is_gw_router && use_common_zone;
> > > -}
> > > -
> > >    static void
> > >    build_distr_lrouter_nat_flows_for_lb(struct
> > > lrouter_nat_lb_flows_ctx *ctx,
> > >                                         enum
> > > lrouter_nat_lb_flow_type type,
> > > @@ -14627,7 +14677,9 @@ build_lrouter_out_snat_flow(struct
> > > lflow_table *lflows,
> > >        }
> > >        ds_put_cstr(match, " && (!ct.trk || !ct.rpl)");
> > >    
> > > -    ds_put_format(actions, "ct_snat(%s", nat->external_ip);
> > > +    /* ct_commit is needed here otherwise replies to the SNATed
> > > traffic
> > > +     * look like "new" traffic in S_ROUTER_IN_ARP_REQUEST*/
> > > +    ds_put_format(actions, "ct_commit; ct_snat(%s", nat-
> > > >external_ip);
> > >        if (nat->external_port_range[0]) {
> > >            ds_put_format(actions, ",%s", nat-
> > > >external_port_range);
> > >        }
> > I think the reason why this patch works is because it essentially
> > ends 
> > up sending everything to conntrack. The problem is that it's likely
> > sending more than is necessary.
> > 
> > If you look in build_lrouter_out_snat_flow(), you can see that we
> > have " 
> > && (!ct.trk || !ct.rpl)" as part of the match before we attempt to
> > SNAT. 
> > Therefore, we at least are attempting not to SNAT reply traffic.

This may be again my incomplete understanding of the role of `ct_next`,
but from my experiments, these rules with "&& (!ct.trk || !ct.rpl)"
would not actually match expected reply traffic unless I injected
`ct_next` into the pipeline before this stage.

> > 
> > As you noted, the [SYN,ACK] sent from "I" in your scenario is not 
> > SNATted. This is because the [SYN] from "E" was committed to
> > conntrack 
> > in the S_ROUTER_OUT_POST_UNDNAT stage. Therefore, the [SYN,ACK] was
> > properly seen as a reply packet, and it was not SNATted.

Looking at the S_ROUTER_OUT_POST_UNDNAT stage, the commits happen only
for the gateway routers. However I observed this behavior in deployment
with distributed router. I'll keep trying to figure out why the first
reply is correct, it's still a mystery to me.

> > 
> > However, for some reason the final [ACK] from "I" was not seen as a
> > reply packet, and therefore it was SNATted. I'm curious if anything
> > about the TCP connection changed between the initial handshake and
> > the 
> > subsequent packets that would make conntrack see the traffic as 
> > "related" instead of part of the same connection. If so, then I
> > think 
> > the trick here is to ensure that ct.rel traffic is committed to 
> > conntrack as well.

I tested this with very simple combination of `nc -l` on one end and
`telnet` on another. I captured a pcap [0] and it doesn't show port
changes or anything funky like that.

> > 
> > Ales Musil added a new action type in OVN 22.12 called
> > "ct_commit_nat" . 
> > Its intended purpose was to ensure that in addition to the direct 
> > connection, related traffic would also be committed to conntrack. 
> > Currently, it's only used for DNAT so that ICMP replies from load 
> > balancer backends will be unDNATted.
> > 
> > It seems we need something similar here, but it needs to be used
> > when 
> > traffic hits the UNSNAT stage of the router pipeline. Maybe it
> > would be 
> > enough to extend the ct_commit_nat command to take an optional
> > "snat" or 
> > "dnat" parameter to choose the appropriate conntrack zone to commit
> > to. 
> > Then ct.rel traffic in the S_ROUTER_IN_UNSNAT stage could call 
> > ct_commit_nat(snat) instead of ct_snat to ensure that the related 
> > traffic is committed to conntrack. I'm not 100% sure if that will
> > do the 
> > trick, but if it does, it will ensure a more focused amount of
> > traffic 
> > will get committed to conntrack. It would also be a much simpler
> > and 
> > smaller patch.
> > 
> > @Ales Musil, did my suggestion above make sense? Or am I completely
> > approaching this incorrectly?
> > 
> > 
> > _______________________________________________
> > dev mailing list
> > [email protected]
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > 
> 
> 
> Hi Martin and Mark,
> 
> I'm not quite sure either why the first handshake part doesn't have
> the
> SNAT IP because it should behave the same. Regardless of that
> I did spend some time trying to figure out what is exactly going on.
> 
> My understanding is that traffic from E to I is not going to CT at
> all,
> so any reply (I -> E) isn't seen as a reply because it's committed
> into
> CT there via the SNAT. Which is why I don't get the handshake part
> being correct when it shouldn't. Which seems to inline with
> description
> from Martin.
> 
> Now for the solution as Mark mentioned the pipeline stages in this
> RFC
> are very wrong for this kind of task. What we need to achieve is to
> commit traffic that is heading towards the logical ip (cidr) of SNAT.
> 
> As Mark mentioned this should be done in S_ROUTER_IN_UNSNAT
> with the match being the same as constructed in
> build_lrouter_in_unsnat_match with exception that the ip.dst should
> be
> the logical ip instead of the external one with action being
> ct_commit,
> unfortunately ct_commit is using the DNAT zone for LR pipeline.
> Meaning we would need to extend that action to allow something like
> ct_commit(dnat), ct_commit(snat). 
> 
> However, before we get to all of that, which I'm not entirely sure if
> that
> might break anything existing, I would like to know what is the use
> case for
> this. It seems slightly odd that it is desired to "leak" the internal
> IP
> and basically leaving it unprotected when setting up FIPS or LB is a
> more
> viable option IMO.
> 
> Let me know if you have any further questions about the proposed
> solution.
> 
> Thanks,
> Ales
> -- 
>  Ales Musil 
>  
> Senior Software Engineer - OVN Core 
>  
>  Red Hat EMEA 
>   
>  [email protected]  
>  
> 

Thank you all for a insightful review. There's a lot for me to think
about here.
As for the actual use case, I'll double check to make sure I have all
the details and get back to you. Right now I mostly know that this is a
use case that exists, it used to work previously in ML2/OVS Openstack
deployment and broke after upgrading to ML2/OVN.

[0] https://pastebin.com/gmbNP1Bg
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to