From: Numan Siddique <[email protected]>
The stateless_icmp_helper flow added previously matched ICMPv4
Destination Unreachable errors (type 3), so it fixed Path MTU discovery
and the other unreachable cases, but left the remaining ICMPv4 error
types un-NATed. Traceroute, which relies on Time Exceeded (type 11)
replies from intermediate hops, still hit the ct.inv drop in the
downstream logical switch: the error quotes the VM's post-SNAT datagram,
so the embedded inner source is the external IP and conntrack cannot
correlate it with the tracked outgoing probe.
The inner-source un-NAT performed by icmp4.inner_ip4.src is correct for
every ICMPv4 error that quotes the original datagram. Broaden the
helper's match from "icmp4.type == 3" to "icmp4.type == {3, 11, 12}",
adding Time Exceeded (type 11, traceroute) and Parameter Problem
(type 12) to the Destination Unreachable errors already handled.
Redirect (type 5) is intentionally excluded: it has special NAT
semantics and is not an error that needs flow correlation.
The pinctrl-side action is already ICMP-type-agnostic, so no controller
change is needed; this is purely a northd match change.
Assisted-by: Claude Opus 4.8, Claude Code
Signed-off-by: Numan Siddique <[email protected]>
---
Documentation/ref/ovn-logical-flows.7.rst | 30 +++++++++++----------
NEWS | 8 +++---
northd/northd.c | 20 ++++++++------
ovn-nb.xml | 32 +++++++++++------------
tests/ovn.at | 23 ++++++++++++++++
5 files changed, 72 insertions(+), 41 deletions(-)
diff --git a/Documentation/ref/ovn-logical-flows.7.rst
b/Documentation/ref/ovn-logical-flows.7.rst
index 56e7b3ef00..121791423a 100644
--- a/Documentation/ref/ovn-logical-flows.7.rst
+++ b/Documentation/ref/ovn-logical-flows.7.rst
@@ -2778,16 +2778,17 @@ flows do not get programmed for load balancers with
IPv6 *VIPs*.
For an IPv4 stateless ``dnat_and_snat`` rule that has
``options:stateless_icmp_helper`` set to ``true`` (the default), an
additional flow at priority *P + 1* is added that matches ``ip && ip4.dst
- == A && icmp4 && icmp4.type == 3`` with the action
+ == A && icmp4 && icmp4.type == {3, 11, 12}`` with the action
``ip4.dst = B; icmp4.inner_ip4.src = B; next;``, where *P* is the priority of
the flow above. This rewrites the outer destination and un-NATs the source
- embedded in the inbound ICMPv4 Destination Unreachable error payload (from
- the external IP *A* back to the logical IP *B*) - every type-3 code quotes
- the original datagram, including ``Fragmentation Needed`` (code 4) - so that
+ embedded in the inbound ICMPv4 error payload (from the external IP *A* back
+ to the logical IP *B*) for every error type that quotes the original
+ datagram - Destination Unreachable (type 3, which includes ``Fragmentation
+ Needed``), Time Exceeded (type 11) and Parameter Problem (type 12) - so that
conntrack in the downstream logical switch can correlate the error with the
- tracked outgoing flow and Path MTU discovery (RFC 1191) works end-to-end
- across stateless NAT. See ``options:stateless_icmp_helper`` in the ``NAT``
- table of the
+ tracked outgoing flow. This makes Path MTU discovery (RFC 1191) and
+ traceroute work end-to-end across stateless NAT. See
+ ``options:stateless_icmp_helper`` in the ``NAT`` table of the
``OVN_Northbound`` database (``ovn-nb`` (5)). The priority is *P + 1* so that
the ``exempted_ext_ips`` bypass flow (at *P + 2*) still wins for traffic
excluded from NAT, and non-ICMP traffic falls through to the regular
@@ -2836,14 +2837,15 @@ the egress pipeline.
For an IPv4 stateless ``dnat_and_snat`` rule that has
``options:stateless_icmp_helper`` set to ``true`` (the default), an
additional priority-101 flow is added that matches ``ip && ip4.dst == B &&
- inport == GW && icmp4 && icmp4.type == 3`` with the action
+ inport == GW && icmp4 && icmp4.type == {3, 11, 12}`` with the action
``ip4.dst = B; icmp4.inner_ip4.src = B; next;``. This rewrites the outer
- destination and un-NATs the source embedded in the inbound ICMPv4
- Destination Unreachable error payload (back to the logical IP *B*) - every
- type-3 code quotes the original datagram, including ``Fragmentation Needed``
- (code 4) - so that conntrack in the downstream logical switch can correlate
- the error with the tracked outgoing flow and Path MTU discovery (RFC 1191)
- works end-to-end across stateless NAT. See
+ destination and un-NATs the source embedded in the inbound ICMPv4 error
+ payload (back to the logical IP *B*) for every error type that quotes the
+ original datagram - Destination Unreachable (type 3, which includes
+ ``Fragmentation Needed``), Time Exceeded (type 11) and Parameter Problem
+ (type 12) - so that conntrack in the downstream logical switch can correlate
+ the error with the tracked outgoing flow. This makes Path MTU discovery
+ (RFC 1191) and traceroute work end-to-end across stateless NAT. See
``options:stateless_icmp_helper`` in the ``NAT`` table of the
``OVN_Northbound`` database (``ovn-nb`` (5)).
diff --git a/NEWS b/NEWS
index 18374dc71b..aeb1cc5c3f 100644
--- a/NEWS
+++ b/NEWS
@@ -37,9 +37,11 @@ Post v26.03.0
IPv4 address embedded in an ICMPv4 error's inner packet. ovn-northd
uses it for stateless "dnat_and_snat" rules, controlled by the new
"options:stateless_icmp_helper" NAT option (default true), so that
- inbound ICMPv4 Destination Unreachable errors (type 3, including the
- "fragmentation needed" message) generated by an external router are
- un-NATed correctly and Path MTU discovery works through stateless NAT.
+ inbound ICMPv4 errors that quote the original datagram - Destination
+ Unreachable (type 3, including "fragmentation needed"), Time Exceeded
+ (type 11) and Parameter Problem (type 12) - generated by an external
+ router are un-NATed correctly. This makes Path MTU discovery and
+ traceroute work through stateless NAT.
OVN v26.03.0 - xxx xx xxxx
--------------------------
diff --git a/northd/northd.c b/northd/northd.c
index 3bb9cafaac..e0820dbfb4 100644
--- a/northd/northd.c
+++ b/northd/northd.c
@@ -17669,13 +17669,16 @@ build_lrouter_in_dnat_flow(struct lflow_table *lflows,
* the inner ip4.src back to the logical IP, that lookup fails and the
* error is dropped as ct.inv.
*
- * Emit a higher-priority flow that matches the same external IP plus any
- * ICMPv4 Destination Unreachable error (type 3) and rewrites the outer
- * ip4.dst and the embedded inner ip4.src to the logical IP. Every type-3
- * code quotes the original datagram (RFC 792), so the inner-source un-NAT
- * is correct for all of them; this covers Fragmentation Needed (code 4,
- * for PMTUD per RFC 1191) as well as host/port unreachable and the rest,
- * so PMTUD works end-to-end through stateless NAT. */
+ * Emit a higher-priority flow that matches the same external IP plus
+ * any ICMPv4 error that quotes the original datagram - Destination
+ * Unreachable (type 3, which includes code 4 Fragmentation Needed for
+ * PMTUD), Time Exceeded (type 11, used by traceroute) and Parameter
+ * Problem (type 12) - and rewrites the outer ip4.dst and the embedded
+ * inner ip4.src to the logical IP. The inner-source un-NAT is correct
+ * for every such error, so PMTUD, traceroute and the other ICMP errors
+ * all work end-to-end through stateless NAT. Redirect (type 5) is
+ * intentionally excluded: it has special NAT semantics and is not an
+ * error that needs flow correlation. */
if (stateless && !is_v6 &&
smap_get_bool(&nat_entry->nb->options, "stateless_icmp_helper",
true)) {
@@ -17684,7 +17687,8 @@ build_lrouter_in_dnat_flow(struct lflow_table *lflows,
meter_groups);
size_t match_len = match->length;
- ds_put_cstr(match, " && icmp4 && icmp4.type == 3");
+ ds_put_cstr(match,
+ " && icmp4 && icmp4.type == {3, 11, 12}");
ds_clear(actions);
ds_put_format(actions,
diff --git a/ovn-nb.xml b/ovn-nb.xml
index 2fc8543868..33a6dc6765 100644
--- a/ovn-nb.xml
+++ b/ovn-nb.xml
@@ -5470,28 +5470,28 @@ or
A stateless DNAT rule rewrites only the outer IPv4 destination of
inbound packets. For an inbound ICMPv4 error (for example a
<code>Fragmentation Needed</code> message generated for Path MTU
- discovery, RFC 1191), the original packet embedded in the ICMP
- payload still carries the external, post-NAT IP as its source.
- When the error reaches the downstream logical switch, conntrack
- cannot correlate the embedded tuple with the tracked outgoing
- flow, the packet is marked <code>ct.inv</code> and dropped by the
- ACL stage, and PMTU discovery breaks.
+ discovery, RFC 1191, or a <code>Time Exceeded</code> message
+ generated for a traceroute probe), the original packet embedded in
+ the ICMP payload still carries the external, post-NAT IP as its
+ source. When the error reaches the downstream logical switch,
+ conntrack cannot correlate the embedded tuple with the tracked
+ outgoing flow, the packet is marked <code>ct.inv</code> and dropped
+ by the ACL stage, and PMTU discovery and traceroute break.
</p>
<p>
When this option is <code>true</code>, <code>ovn-northd</code>
emits an additional, higher-priority logical flow in the router
- ingress DNAT stage that matches ICMPv4 Destination Unreachable
- (type 3) errors destined to the external IP. Every type-3 code
- quotes the original datagram, so this covers
- <code>Fragmentation Needed</code> (code 4) for PMTU discovery as
- well as host/port unreachable and the rest. It rewrites the outer
- IPv4 destination to the logical IP and, using the
- <code>icmp4.inner_ip4.src</code> action, un-NATs the embedded
+ ingress DNAT stage that matches the ICMPv4 errors which quote the
+ original datagram - Destination Unreachable (type 3, which includes
+ <code>Fragmentation Needed</code>), Time Exceeded (type 11) and
+ Parameter Problem (type 12) - destined to the external IP. It
+ rewrites the outer IPv4 destination to the logical IP and, using
+ the <code>icmp4.inner_ip4.src</code> action, un-NATs the embedded
inner IPv4 source from the external IP back to the logical IP, so
- that conntrack can correlate the error and PMTU discovery works
- end-to-end. Set it to <code>false</code> to suppress this flow for
- an individual NAT entry.
+ that conntrack can correlate the error. This makes PMTU discovery
+ and traceroute work end-to-end. Set it to <code>false</code> to
+ suppress this flow for an individual NAT entry.
</p>
</column>
diff --git a/tests/ovn.at b/tests/ovn.at
index e43579cdd7..d67ca2ead5 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -22816,6 +22816,29 @@ as hv1 reset_pcap_file hv1-vif1 hv1/vif1
check as hv1 ovs-appctl netdev-dummy/receive hv1-vif2 $packet
+# ICMP Time Exceeded error from a transit router to the NAT external IP.
+# The embedded original packet is the VM's outbound probe after stateless
+# SNAT, so its inner source is the external IP and its inner destination is
+# the traceroute target (50.0.0.100).
+packet=$(fmt_pkt "Ether(dst='$rtr_ext_mac', src='$ext1_mac')/ \
+ IP(src='$ext1_ip', dst='$nat_ext_ip', ttl=64)/ \
+ ICMP(type=11, code=0)/ \
+ IP(src='$nat_ext_ip', dst='50.0.0.100', ttl=1, proto=17)/ \
+ bytes.fromhex('$inner_l4')")
+
+# Expected packet delivered to vm1: outer destination DNATed to 10.0.0.3,
+# outer TTL decremented, L2 rewritten, and (via pinctrl) the inner IPv4
+# source un-NATed to 10.0.0.3. The inner destination (50.0.0.100) is left
+# unchanged.
+expected=$(fmt_pkt "Ether(dst='$vm1_mac', src='$rtr_int_mac')/ \
+ IP(src='$ext1_ip', dst='$vm1_ip', ttl=63)/ \
+ ICMP(type=11, code=0)/ \
+ IP(src='$vm1_ip', dst='50.0.0.100', ttl=1, proto=17)/ \
+ bytes.fromhex('$inner_l4')")
+echo $expected >> vif1.expected
+
+check as hv1 ovs-appctl netdev-dummy/receive hv1-vif2 $packet
+
OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [vif1.expected])
OVN_CLEANUP([hv1])
--
2.54.0
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev