> From: Numan Siddique <[email protected]> > > The stateless_icmp_helper flow added previously matched ICMPv4 > Destination Unreachable errors (type 3), so it fixed Path MTU discovery > and the other unreachable cases, but left the remaining ICMPv4 error > types un-NATed. Traceroute, which relies on Time Exceeded (type 11) > replies from intermediate hops, still hit the ct.inv drop in the > downstream logical switch: the error quotes the VM's post-SNAT datagram, > so the embedded inner source is the external IP and conntrack cannot > correlate it with the tracked outgoing probe. > > The inner-source un-NAT performed by icmp4.inner_ip4.src is correct for > every ICMPv4 error that quotes the original datagram. Broaden the > helper's match from "icmp4.type == 3" to "icmp4.type == {3, 11, 12}", > adding Time Exceeded (type 11, traceroute) and Parameter Problem > (type 12) to the Destination Unreachable errors already handled. > Redirect (type 5) is intentionally excluded: it has special NAT > semantics and is not an error that needs flow correlation. > > The pinctrl-side action is already ICMP-type-agnostic, so no controller > change is needed; this is purely a northd match change. > > Assisted-by: Claude Opus 4.8, Claude Code > Signed-off-by: Numan Siddique <[email protected]>
Acked-by: Lorenzo Bianconi <[email protected]> > --- > Documentation/ref/ovn-logical-flows.7.rst | 30 +++++++++++---------- > NEWS | 8 +++--- > northd/northd.c | 20 ++++++++------ > ovn-nb.xml | 32 +++++++++++------------ > tests/ovn.at | 23 ++++++++++++++++ > 5 files changed, 72 insertions(+), 41 deletions(-) > > diff --git a/Documentation/ref/ovn-logical-flows.7.rst > b/Documentation/ref/ovn-logical-flows.7.rst > index 56e7b3ef00..121791423a 100644 > --- a/Documentation/ref/ovn-logical-flows.7.rst > +++ b/Documentation/ref/ovn-logical-flows.7.rst > @@ -2778,16 +2778,17 @@ flows do not get programmed for load balancers with > IPv6 *VIPs*. > For an IPv4 stateless ``dnat_and_snat`` rule that has > ``options:stateless_icmp_helper`` set to ``true`` (the default), an > additional flow at priority *P + 1* is added that matches ``ip && ip4.dst > - == A && icmp4 && icmp4.type == 3`` with the action > + == A && icmp4 && icmp4.type == {3, 11, 12}`` with the action > ``ip4.dst = B; icmp4.inner_ip4.src = B; next;``, where *P* is the priority > of > the flow above. This rewrites the outer destination and un-NATs the source > - embedded in the inbound ICMPv4 Destination Unreachable error payload (from > - the external IP *A* back to the logical IP *B*) - every type-3 code quotes > - the original datagram, including ``Fragmentation Needed`` (code 4) - so > that > + embedded in the inbound ICMPv4 error payload (from the external IP *A* back > + to the logical IP *B*) for every error type that quotes the original > + datagram - Destination Unreachable (type 3, which includes ``Fragmentation > + Needed``), Time Exceeded (type 11) and Parameter Problem (type 12) - so > that > conntrack in the downstream logical switch can correlate the error with the > - tracked outgoing flow and Path MTU discovery (RFC 1191) works end-to-end > - across stateless NAT. See ``options:stateless_icmp_helper`` in the ``NAT`` > - table of the > + tracked outgoing flow. This makes Path MTU discovery (RFC 1191) and > + traceroute work end-to-end across stateless NAT. See > + ``options:stateless_icmp_helper`` in the ``NAT`` table of the > ``OVN_Northbound`` database (``ovn-nb`` (5)). The priority is *P + 1* so > that > the ``exempted_ext_ips`` bypass flow (at *P + 2*) still wins for traffic > excluded from NAT, and non-ICMP traffic falls through to the regular > @@ -2836,14 +2837,15 @@ the egress pipeline. > For an IPv4 stateless ``dnat_and_snat`` rule that has > ``options:stateless_icmp_helper`` set to ``true`` (the default), an > additional priority-101 flow is added that matches ``ip && ip4.dst == B && > - inport == GW && icmp4 && icmp4.type == 3`` with the action > + inport == GW && icmp4 && icmp4.type == {3, 11, 12}`` with the action > ``ip4.dst = B; icmp4.inner_ip4.src = B; next;``. This rewrites the outer > - destination and un-NATs the source embedded in the inbound ICMPv4 > - Destination Unreachable error payload (back to the logical IP *B*) - every > - type-3 code quotes the original datagram, including ``Fragmentation > Needed`` > - (code 4) - so that conntrack in the downstream logical switch can correlate > - the error with the tracked outgoing flow and Path MTU discovery (RFC 1191) > - works end-to-end across stateless NAT. See > + destination and un-NATs the source embedded in the inbound ICMPv4 error > + payload (back to the logical IP *B*) for every error type that quotes the > + original datagram - Destination Unreachable (type 3, which includes > + ``Fragmentation Needed``), Time Exceeded (type 11) and Parameter Problem > + (type 12) - so that conntrack in the downstream logical switch can > correlate > + the error with the tracked outgoing flow. This makes Path MTU discovery > + (RFC 1191) and traceroute work end-to-end across stateless NAT. See > ``options:stateless_icmp_helper`` in the ``NAT`` table of the > ``OVN_Northbound`` database (``ovn-nb`` (5)). > > diff --git a/NEWS b/NEWS > index 18374dc71b..aeb1cc5c3f 100644 > --- a/NEWS > +++ b/NEWS > @@ -37,9 +37,11 @@ Post v26.03.0 > IPv4 address embedded in an ICMPv4 error's inner packet. ovn-northd > uses it for stateless "dnat_and_snat" rules, controlled by the new > "options:stateless_icmp_helper" NAT option (default true), so that > - inbound ICMPv4 Destination Unreachable errors (type 3, including the > - "fragmentation needed" message) generated by an external router are > - un-NATed correctly and Path MTU discovery works through stateless NAT. > + inbound ICMPv4 errors that quote the original datagram - Destination > + Unreachable (type 3, including "fragmentation needed"), Time Exceeded > + (type 11) and Parameter Problem (type 12) - generated by an external > + router are un-NATed correctly. This makes Path MTU discovery and > + traceroute work through stateless NAT. > > OVN v26.03.0 - xxx xx xxxx > -------------------------- > diff --git a/northd/northd.c b/northd/northd.c > index 3bb9cafaac..e0820dbfb4 100644 > --- a/northd/northd.c > +++ b/northd/northd.c > @@ -17669,13 +17669,16 @@ build_lrouter_in_dnat_flow(struct lflow_table > *lflows, > * the inner ip4.src back to the logical IP, that lookup fails and the > * error is dropped as ct.inv. > * > - * Emit a higher-priority flow that matches the same external IP plus any > - * ICMPv4 Destination Unreachable error (type 3) and rewrites the outer > - * ip4.dst and the embedded inner ip4.src to the logical IP. Every > type-3 > - * code quotes the original datagram (RFC 792), so the inner-source > un-NAT > - * is correct for all of them; this covers Fragmentation Needed (code 4, > - * for PMTUD per RFC 1191) as well as host/port unreachable and the rest, > - * so PMTUD works end-to-end through stateless NAT. */ > + * Emit a higher-priority flow that matches the same external IP plus > + * any ICMPv4 error that quotes the original datagram - Destination > + * Unreachable (type 3, which includes code 4 Fragmentation Needed for > + * PMTUD), Time Exceeded (type 11, used by traceroute) and Parameter > + * Problem (type 12) - and rewrites the outer ip4.dst and the embedded > + * inner ip4.src to the logical IP. The inner-source un-NAT is correct > + * for every such error, so PMTUD, traceroute and the other ICMP errors > + * all work end-to-end through stateless NAT. Redirect (type 5) is > + * intentionally excluded: it has special NAT semantics and is not an > + * error that needs flow correlation. */ > if (stateless && !is_v6 && > smap_get_bool(&nat_entry->nb->options, "stateless_icmp_helper", > true)) { > @@ -17684,7 +17687,8 @@ build_lrouter_in_dnat_flow(struct lflow_table *lflows, > meter_groups); > size_t match_len = match->length; > > - ds_put_cstr(match, " && icmp4 && icmp4.type == 3"); > + ds_put_cstr(match, > + " && icmp4 && icmp4.type == {3, 11, 12}"); > > ds_clear(actions); > ds_put_format(actions, > diff --git a/ovn-nb.xml b/ovn-nb.xml > index 2fc8543868..33a6dc6765 100644 > --- a/ovn-nb.xml > +++ b/ovn-nb.xml > @@ -5470,28 +5470,28 @@ or > A stateless DNAT rule rewrites only the outer IPv4 destination of > inbound packets. For an inbound ICMPv4 error (for example a > <code>Fragmentation Needed</code> message generated for Path MTU > - discovery, RFC 1191), the original packet embedded in the ICMP > - payload still carries the external, post-NAT IP as its source. > - When the error reaches the downstream logical switch, conntrack > - cannot correlate the embedded tuple with the tracked outgoing > - flow, the packet is marked <code>ct.inv</code> and dropped by the > - ACL stage, and PMTU discovery breaks. > + discovery, RFC 1191, or a <code>Time Exceeded</code> message > + generated for a traceroute probe), the original packet embedded in > + the ICMP payload still carries the external, post-NAT IP as its > + source. When the error reaches the downstream logical switch, > + conntrack cannot correlate the embedded tuple with the tracked > + outgoing flow, the packet is marked <code>ct.inv</code> and dropped > + by the ACL stage, and PMTU discovery and traceroute break. > </p> > > <p> > When this option is <code>true</code>, <code>ovn-northd</code> > emits an additional, higher-priority logical flow in the router > - ingress DNAT stage that matches ICMPv4 Destination Unreachable > - (type 3) errors destined to the external IP. Every type-3 code > - quotes the original datagram, so this covers > - <code>Fragmentation Needed</code> (code 4) for PMTU discovery as > - well as host/port unreachable and the rest. It rewrites the outer > - IPv4 destination to the logical IP and, using the > - <code>icmp4.inner_ip4.src</code> action, un-NATs the embedded > + ingress DNAT stage that matches the ICMPv4 errors which quote the > + original datagram - Destination Unreachable (type 3, which includes > + <code>Fragmentation Needed</code>), Time Exceeded (type 11) and > + Parameter Problem (type 12) - destined to the external IP. It > + rewrites the outer IPv4 destination to the logical IP and, using > + the <code>icmp4.inner_ip4.src</code> action, un-NATs the embedded > inner IPv4 source from the external IP back to the logical IP, so > - that conntrack can correlate the error and PMTU discovery works > - end-to-end. Set it to <code>false</code> to suppress this flow for > - an individual NAT entry. > + that conntrack can correlate the error. This makes PMTU discovery > + and traceroute work end-to-end. Set it to <code>false</code> to > + suppress this flow for an individual NAT entry. > </p> > </column> > > diff --git a/tests/ovn.at b/tests/ovn.at > index e43579cdd7..d67ca2ead5 100644 > --- a/tests/ovn.at > +++ b/tests/ovn.at > @@ -22816,6 +22816,29 @@ as hv1 reset_pcap_file hv1-vif1 hv1/vif1 > > check as hv1 ovs-appctl netdev-dummy/receive hv1-vif2 $packet > > +# ICMP Time Exceeded error from a transit router to the NAT external IP. > +# The embedded original packet is the VM's outbound probe after stateless > +# SNAT, so its inner source is the external IP and its inner destination is > +# the traceroute target (50.0.0.100). > +packet=$(fmt_pkt "Ether(dst='$rtr_ext_mac', src='$ext1_mac')/ \ > + IP(src='$ext1_ip', dst='$nat_ext_ip', ttl=64)/ \ > + ICMP(type=11, code=0)/ \ > + IP(src='$nat_ext_ip', dst='50.0.0.100', ttl=1, proto=17)/ \ > + bytes.fromhex('$inner_l4')") > + > +# Expected packet delivered to vm1: outer destination DNATed to 10.0.0.3, > +# outer TTL decremented, L2 rewritten, and (via pinctrl) the inner IPv4 > +# source un-NATed to 10.0.0.3. The inner destination (50.0.0.100) is left > +# unchanged. > +expected=$(fmt_pkt "Ether(dst='$vm1_mac', src='$rtr_int_mac')/ \ > + IP(src='$ext1_ip', dst='$vm1_ip', ttl=63)/ \ > + ICMP(type=11, code=0)/ \ > + IP(src='$vm1_ip', dst='50.0.0.100', ttl=1, proto=17)/ \ > + bytes.fromhex('$inner_l4')") > +echo $expected >> vif1.expected > + > +check as hv1 ovs-appctl netdev-dummy/receive hv1-vif2 $packet > + > OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [vif1.expected]) > > OVN_CLEANUP([hv1]) > -- > 2.54.0 > > _______________________________________________ > dev mailing list > [email protected] > https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
_______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
