On Thu, 2024-02-08 at 09:30 +0100, Ales Musil wrote: > > > On Thu, Feb 8, 2024 at 1:22 AM Mark Michelson <[email protected]> > wrote: > > Hi Martin > > > > Thanks a bunch for your first OVN contribution. I have comments > > below. > > > > On 2/7/24 10:56, Martin Kalcok wrote: > > > When a machine from an external network tries to access machine > > > in the > > > OVN's internal network with SNAT enabled, using protocol like > > > TCP, it > > > receives connection reset after first packets are successfully > > > exchanged. > > > > > > This setup may be a bit unusual and requires that whatever is > > > responsible > > > for routing in the external network, routes packets for OVN's > > > internal > > > network via LR's external port IP. However it is something that > > > used to > > > work in ML2/OVS Openstack deployments and got broken after > > > upgrade to > > > ML2/OVN. > > > > > > To demonstrate what the traffic looked like from the point of > > > view of the > > > external machine, lets say we have: > > > * E (IP of machine in the external network) > > > * I (IP of machine in OVN's internal network) > > > * LR (IP of OVN's Logical Router Port connected to external > > > network) > > > > > > The traffic looks like this: > > > * [SYN] E -> I > > > * [SYN,ACK] I -> E > > > * [ACK] E -> I > > > * [PSH,ACK] E -> I > > > * [ACK] LR -> E > > > > > > Connection gets reseted after it receives ACK from an unexpected > > > IP (LR). > > > > > > Although I'm not completely sure why the first [SYN,ACK] reply > > > got back > > > without SNAT, root cause of this issue is that traffic initiated > > > from > > > the external network is not tracked in CT. This causes even > > > traffic that's > > > in the "reply" direction (i.e. it was initiated from outside the > > > SNAT > > > network), to be translated by Logical Router. > > > > > > My proposal is to initiate CT and commit "new" connections > > > destined > > > to the internal network (with enabled SNAT) on Logical Router. > > > This > > > will enable SNAT rules to translate only traffic that originates > > > from > > > internal networks. > > > > > > Signed-off-by: Martin Kalcok <[email protected]> > > > --- > > > northd/northd.c | 70 > > > ++++++++++++++++++++++++++++++++++++++++++------- > > > 1 file changed, 61 insertions(+), 9 deletions(-) > > > > > > As this would be my first "real" contribution to the OVN, I'd > > > love to hear > > > some feeback on my approach while I work on performance > > > measurments and > > > tests, which I plan on including in final proposal. > > > I also left some inline XXX comments marking parts that I'm not > > > sure about > > > and need resolving before final proposal. > > > I tested this change in my local Openstack deployment with > > > distributed gw > > > chassis and it seems to solve the issue regardless of the VM's > > > placement, > > > whether it does or does not have a floating IP. It also doesn't > > > seem to > > > break outbound connectivity, nor connectivity to VM's floating > > > IPs. > > > Thank you, for any reviews/sugestions. > > > > Without getting into the details of this particular patch, I have > > some > > general suggestions for contributing: > > > > 1) Before sending a patch, run utilities/checkpatch.py against all > > patches in the series. This can help 0-day Robot not to complain > > about > > coding guidelines violations. > > > > 2) Run the testsuite locally before posting. If you see test > > errors, > > double check if your change is what's causing them.
Sorry about the failing tests. I ran the unit tests a checked the errors (that were caused by me) but given that I intended this as an RFC and I wasn't completely sure about my approach I didn't want to spend too much effort on fixing/adding unit tests that might get scraped. In the hindsight, I should've put more effort in and post a clean work for the review. > > > > 3) As a sanity check, consider pushing the code to your github > > fork, and > > ensure that github actions complete without error. If there are any > > errors, double check that they're not the fault of github (e.g. > > it's not > > your fault if the action fails because the machine runs out of disk > > space while allocating pods). In this case, The OVS robot links to > > https://github.com/ovsrobot/ovn/actions/runs/7817861729/job/21326755181 > > , where there are some compilation errors reported. I've pointed > > out the > > offending lines in the diff below. > > > > 4) Consider adding test cases. In this case, I suggest adding a > > couple: > > * First, you can add one to ovn-northd.at that strictly tests > > whether > > the generated logical flows for SNAT are as expected. It would also > > be > > good to ensure that the flows are updated properly when the SNAT > > configuration changes on the logical router. > > * Second, I would add a test to system-ovn.at that simulates the > > broken scenario. It could ensure that packets originated internally > > are > > SNATted, and that replies to packets originated externally are not > > SNATted. Thank you, these are great starting points. > > > > > > > > diff --git a/northd/northd.c b/northd/northd.c > > > index 01eec64ca..b695932fd 100644 > > > --- a/northd/northd.c > > > +++ b/northd/northd.c > > > @@ -10772,6 +10772,12 @@ build_ecmp_route_flow(struct lflow_table > > > *lflows, struct ovn_datapath *od, > > > ds_destroy(&actions); > > > } > > > > > > +static inline bool > > > +lrouter_use_common_zone(const struct ovn_datapath *od) > > > +{ > > > + return !od->is_gw_router && use_common_zone; > > > +} > > > + > > > static void > > > add_route(struct lflow_table *lflows, struct ovn_datapath *od, > > > const struct ovn_port *op, const char *lrp_addr_s, > > > @@ -10796,6 +10802,9 @@ add_route(struct lflow_table *lflows, > > > struct ovn_datapath *od, > > > > I'm not sure that add_route() is a good fit for this new code. > > add_route() is called for every static route, logical router port > > IP, > > load balancer VIP, and NAT external address. It doesn't really make > > sense to be iterating through the router's NATs when adding a > > static > > route, for instance. Further, this has the potential to attempt to > > add > > duplicate flows since a router can (and likely will) have > > add_route() > > called multiple times for its addresses. > > > > If you need to add new flows related to the configured NAT rules on > > a > > router, I'd suggest looking for an appropriate place in > > build_lrouter_nat_defrag_and_lb() or the functions it calls. We > > iterate > > over the configured NAT rules there, so it's a natural fit for > > these > > types of changes. Thanks, I'll move the logic around and insert rules into more appropriate tables. My original proposal was based mostly on injecting rules into tables that I saw traffic hitting on the way in/out in the ovn-trace. I strongly suspected that they might not be correct. > > > > > build_route_match(op_inport, rtb_id, network_s, plen, > > > is_src_route, > > > is_ipv4, &match, &priority, ofs); > > > > > > + bool use_ct = false; > > > + struct ds target_net = DS_EMPTY_INITIALIZER; > > > + struct ds snat_port_match = DS_EMPTY_INITIALIZER; > > > struct ds common_actions = DS_EMPTY_INITIALIZER; > > > struct ds actions = DS_EMPTY_INITIALIZER; > > > if (is_discard_route) { > > > @@ -10808,16 +10817,61 @@ add_route(struct lflow_table *lflows, > > > struct ovn_datapath *od, > > > } else { > > > ds_put_format(&common_actions, "ip%s.dst", is_ipv4 > > > ? "4" : "6"); > > > } > > > + > > > + /* To enable direct connectivity from external networks > > > to Logical IPs > > > + * of VMs/Containers in the SNATed networks, we need to > > > track > > > + * connections initiated from external networks. This > > > allows SNAT > > > + * rules to avoid SNATing packets in "reply" direction. > > > */ > > > + if (od->nbr && od->nbr->n_nat) { > > > + ds_put_format(&target_net, "%s/%d", network_s, > > > plen); > > > + for (int i = 0; i < od->nbr->n_nat; i++) { > > > + const struct nbrec_nat *nat = od->nbr->nat[i]; > > > + if (strcmp(nat->type, "snat") || > > > + strcmp(nat->logical_ip, > > > ds_cstr(&target_net)) || > > > > We allow IPv6 addresses for NAT. strcmp isn't a good way to compare > > IPv6 > > addresses since the same address can have many string > > representations. Ack, thanks. > > > > > + lrouter_use_common_zone(od)) { > > > + continue; > > > + } > > > + > > > + /* Initiate connection tracking for traffic > > > destined to > > > + * SNATed network. */ > > > + use_ct = true; > > > + > > > + /* XXX: Need to figure out what is appropraite > > > priority for these rules. > > > + All they require is to be after any > > > existing specific rules > > > + and before the default rule.*/ > > > > There aren't any hard and fast rules for picking a priority number, > > other than what you stated in this comment. As long as it gets the > > job > > done, you can pick any appropriate priority number. It's best to > > try to > > leave some room for new priorities to be inserted in case new rules > > arise, but other than that, pick whatever you want. Ack. > > > > > + > > > + /* Commit new connections initiated from the > > > outside of > > > + * SNATed network to CT. */ > > > + ds_put_format(&snat_port_match, > > > + "outport == %s && ct.new", > > > + op->json_key); > > > + ovn_lflow_add_with_hint(lflows, od, > > > S_ROUTER_IN_ARP_REQUEST, 5, > > > + > > > ds_cstr(&snat_port_match), > > > + "ct_commit; output;", > > > stage_hint); > > > > I'm not sure I understand the choice of S_ROUTER_IN_ARP_REQUEST for > > performing the ct_commit; This doesn't have anything to do with ARP > > requests, so it seems like a different table would fit better here. So this was one of my biggest struggles. If I understand connection tracking in OVN correctly (from docs/experimenting), in order to have access to connection state (`ct.new`/`ct.rpl`/...) or to use `ct_commit`, it must first be initiated with `ct_next`. This initiates connection tracking information for the flow and moves on to the next table (stage?). This meant for me that I had to alter at least two tables in each direction. One that would call `ct_next` and another that would use this initiated context to perform either `ct_commit` or match on fields like `ct.rpl`. The reason why I put `ct_commit` here was that it was the last stage in ingress pipeline and packet was guaranteed to hit it before being processed by egress pipeline. Maybe I just missed it, but I didn't find an action that would initiate CT information (like `ct_next`) and resubmit the packet to the same table instead fo going to the next. If there was an option like this, I could keep changes to the single stage. > > > > > + ds_clear(&snat_port_match); > > > + > > > + /* Initiate connection tracking for traffic from > > > SNATed > > > + * network to enable rules in S_ROUTER_OUT_SNAT > > > disregard > > > + * traffic in "reply" direction. */ > > > + ds_put_format(&snat_port_match, "inport == %s", > > > op->json_key); > > > + ovn_lflow_add_with_hint(lflows, od, > > > S_ROUTER_OUT_POST_UNDNAT, > > > + 5, > > > ds_cstr(&snat_port_match), > > > + "ct_next;", stage_hint); > > > > Your two calls to ovn_lflow_add_with_hint() are missing the final > > lflow_ref parameter, which is why github actions failed when > > testing. Ack, thanks. > > > > > + break; > > > + } > > > + } > > > + > > > ds_put_format(&common_actions, "; " > > > "%s = %s; " > > > "eth.src = %s; " > > > "outport = %s; " > > > "flags.loopback = 1; " > > > - "next;", > > > + "%s", > > > is_ipv4 ? REG_SRC_IPV4 : REG_SRC_IPV6, > > > lrp_addr_s, > > > op->lrp_networks.ea_s, > > > - op->json_key); > > > + op->json_key, > > > + use_ct ? "ct_next;" : "next;"); > > > > I'm a bit confused by this addition. In your scenario, E sends > > traffic > > to I, and I sends traffic to E. Traffic is never destined for a LRP > > address, load balancer VIP, or NAT external address. So I don't > > know why > > this flow would be matched in your scenario. Did I miss a detail > > somewhere? This flow would match because "I" is part of destination network for which this `add_route` is called. > > > > Even if this is necessary for your scenario, I think this is adding > > conntrack to situations where it shouldn't be present, such as > > traffic > > destined for static routes. I tried to limit conntract initiation only for traffic destined to networks with enabled SNAT. Anyway, as you pointed earlier, these rules should probably go elsewhere. > > > > > ds_put_format(&actions, "ip.ttl--; %s", > > > ds_cstr(&common_actions)); > > > } > > > > > > @@ -10833,6 +10887,8 @@ add_route(struct lflow_table *lflows, > > > struct ovn_datapath *od, > > > ds_cstr(&common_actions),\ > > > stage_hint, lflow_ref); > > > } > > > + ds_destroy(&snat_port_match); > > > + ds_destroy(&target_net); > > > ds_destroy(&match); > > > ds_destroy(&common_actions); > > > ds_destroy(&actions); > > > @@ -10933,12 +10989,6 @@ struct lrouter_nat_lb_flows_ctx { > > > const struct shash *meter_groups; > > > }; > > > > > > -static inline bool > > > -lrouter_use_common_zone(const struct ovn_datapath *od) > > > -{ > > > - return !od->is_gw_router && use_common_zone; > > > -} > > > - > > > static void > > > build_distr_lrouter_nat_flows_for_lb(struct > > > lrouter_nat_lb_flows_ctx *ctx, > > > enum > > > lrouter_nat_lb_flow_type type, > > > @@ -14627,7 +14677,9 @@ build_lrouter_out_snat_flow(struct > > > lflow_table *lflows, > > > } > > > ds_put_cstr(match, " && (!ct.trk || !ct.rpl)"); > > > > > > - ds_put_format(actions, "ct_snat(%s", nat->external_ip); > > > + /* ct_commit is needed here otherwise replies to the SNATed > > > traffic > > > + * look like "new" traffic in S_ROUTER_IN_ARP_REQUEST*/ > > > + ds_put_format(actions, "ct_commit; ct_snat(%s", nat- > > > >external_ip); > > > if (nat->external_port_range[0]) { > > > ds_put_format(actions, ",%s", nat- > > > >external_port_range); > > > } > > I think the reason why this patch works is because it essentially > > ends > > up sending everything to conntrack. The problem is that it's likely > > sending more than is necessary. > > > > If you look in build_lrouter_out_snat_flow(), you can see that we > > have " > > && (!ct.trk || !ct.rpl)" as part of the match before we attempt to > > SNAT. > > Therefore, we at least are attempting not to SNAT reply traffic. This may be again my incomplete understanding of the role of `ct_next`, but from my experiments, these rules with "&& (!ct.trk || !ct.rpl)" would not actually match expected reply traffic unless I injected `ct_next` into the pipeline before this stage. > > > > As you noted, the [SYN,ACK] sent from "I" in your scenario is not > > SNATted. This is because the [SYN] from "E" was committed to > > conntrack > > in the S_ROUTER_OUT_POST_UNDNAT stage. Therefore, the [SYN,ACK] was > > properly seen as a reply packet, and it was not SNATted. Looking at the S_ROUTER_OUT_POST_UNDNAT stage, the commits happen only for the gateway routers. However I observed this behavior in deployment with distributed router. I'll keep trying to figure out why the first reply is correct, it's still a mystery to me. > > > > However, for some reason the final [ACK] from "I" was not seen as a > > reply packet, and therefore it was SNATted. I'm curious if anything > > about the TCP connection changed between the initial handshake and > > the > > subsequent packets that would make conntrack see the traffic as > > "related" instead of part of the same connection. If so, then I > > think > > the trick here is to ensure that ct.rel traffic is committed to > > conntrack as well. I tested this with very simple combination of `nc -l` on one end and `telnet` on another. I captured a pcap [0] and it doesn't show port changes or anything funky like that. > > > > Ales Musil added a new action type in OVN 22.12 called > > "ct_commit_nat" . > > Its intended purpose was to ensure that in addition to the direct > > connection, related traffic would also be committed to conntrack. > > Currently, it's only used for DNAT so that ICMP replies from load > > balancer backends will be unDNATted. > > > > It seems we need something similar here, but it needs to be used > > when > > traffic hits the UNSNAT stage of the router pipeline. Maybe it > > would be > > enough to extend the ct_commit_nat command to take an optional > > "snat" or > > "dnat" parameter to choose the appropriate conntrack zone to commit > > to. > > Then ct.rel traffic in the S_ROUTER_IN_UNSNAT stage could call > > ct_commit_nat(snat) instead of ct_snat to ensure that the related > > traffic is committed to conntrack. I'm not 100% sure if that will > > do the > > trick, but if it does, it will ensure a more focused amount of > > traffic > > will get committed to conntrack. It would also be a much simpler > > and > > smaller patch. > > > > @Ales Musil, did my suggestion above make sense? Or am I completely > > approaching this incorrectly? > > > > > > _______________________________________________ > > dev mailing list > > [email protected] > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > > Hi Martin and Mark, > > I'm not quite sure either why the first handshake part doesn't have > the > SNAT IP because it should behave the same. Regardless of that > I did spend some time trying to figure out what is exactly going on. > > My understanding is that traffic from E to I is not going to CT at > all, > so any reply (I -> E) isn't seen as a reply because it's committed > into > CT there via the SNAT. Which is why I don't get the handshake part > being correct when it shouldn't. Which seems to inline with > description > from Martin. > > Now for the solution as Mark mentioned the pipeline stages in this > RFC > are very wrong for this kind of task. What we need to achieve is to > commit traffic that is heading towards the logical ip (cidr) of SNAT. > > As Mark mentioned this should be done in S_ROUTER_IN_UNSNAT > with the match being the same as constructed in > build_lrouter_in_unsnat_match with exception that the ip.dst should > be > the logical ip instead of the external one with action being > ct_commit, > unfortunately ct_commit is using the DNAT zone for LR pipeline. > Meaning we would need to extend that action to allow something like > ct_commit(dnat), ct_commit(snat). > > However, before we get to all of that, which I'm not entirely sure if > that > might break anything existing, I would like to know what is the use > case for > this. It seems slightly odd that it is desired to "leak" the internal > IP > and basically leaving it unprotected when setting up FIPS or LB is a > more > viable option IMO. > > Let me know if you have any further questions about the proposed > solution. > > Thanks, > Ales > -- > Ales Musil > > Senior Software Engineer - OVN Core > > Red Hat EMEA > > [email protected] > > Thank you all for a insightful review. There's a lot for me to think about here. As for the actual use case, I'll double check to make sure I have all the details and get back to you. Right now I mostly know that this is a use case that exists, it used to work previously in ML2/OVS Openstack deployment and broke after upgrading to ML2/OVN. [0] https://pastebin.com/gmbNP1Bg _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
