From: Numan Siddique <[email protected]>

The stateless_icmp_helper flow added previously matched ICMPv4
Destination Unreachable errors (type 3), so it fixed Path MTU discovery
and the other unreachable cases, but left the remaining ICMPv4 error
types un-NATed.  Traceroute, which relies on Time Exceeded (type 11)
replies from intermediate hops, still hit the ct.inv drop in the
downstream logical switch: the error quotes the VM's post-SNAT datagram,
so the embedded inner source is the external IP and conntrack cannot
correlate it with the tracked outgoing probe.

The inner-source un-NAT performed by icmp4.inner_ip4.src is correct for
every ICMPv4 error that quotes the original datagram.  Broaden the
helper's match from "icmp4.type == 3" to "icmp4.type == {3, 11, 12}",
adding Time Exceeded (type 11, traceroute) and Parameter Problem
(type 12) to the Destination Unreachable errors already handled.
Redirect (type 5) is intentionally excluded: it has special NAT
semantics and is not an error that needs flow correlation.

The pinctrl-side action is already ICMP-type-agnostic, so no controller
change is needed; this is purely a northd match change.

Assisted-by: Claude Opus 4.8, Claude Code
Signed-off-by: Numan Siddique <[email protected]>
---
 Documentation/ref/ovn-logical-flows.7.rst | 30 +++++++++++----------
 NEWS                                      |  8 +++---
 northd/northd.c                           | 20 ++++++++------
 ovn-nb.xml                                | 32 +++++++++++------------
 tests/ovn.at                              | 23 ++++++++++++++++
 5 files changed, 72 insertions(+), 41 deletions(-)

diff --git a/Documentation/ref/ovn-logical-flows.7.rst 
b/Documentation/ref/ovn-logical-flows.7.rst
index 56e7b3ef00..121791423a 100644
--- a/Documentation/ref/ovn-logical-flows.7.rst
+++ b/Documentation/ref/ovn-logical-flows.7.rst
@@ -2778,16 +2778,17 @@ flows do not get programmed for load balancers with 
IPv6 *VIPs*.
   For an IPv4 stateless ``dnat_and_snat`` rule that has
   ``options:stateless_icmp_helper`` set to ``true`` (the default), an
   additional flow at priority *P + 1* is added that matches ``ip && ip4.dst
-  == A && icmp4 && icmp4.type == 3`` with the action
+  == A && icmp4 && icmp4.type == {3, 11, 12}`` with the action
   ``ip4.dst = B; icmp4.inner_ip4.src = B; next;``, where *P* is the priority of
   the flow above. This rewrites the outer destination and un-NATs the source
-  embedded in the inbound ICMPv4 Destination Unreachable error payload (from
-  the external IP *A* back to the logical IP *B*) - every type-3 code quotes
-  the original datagram, including ``Fragmentation Needed`` (code 4) - so that
+  embedded in the inbound ICMPv4 error payload (from the external IP *A* back
+  to the logical IP *B*) for every error type that quotes the original
+  datagram - Destination Unreachable (type 3, which includes ``Fragmentation
+  Needed``), Time Exceeded (type 11) and Parameter Problem (type 12) - so that
   conntrack in the downstream logical switch can correlate the error with the
-  tracked outgoing flow and Path MTU discovery (RFC 1191) works end-to-end
-  across stateless NAT. See ``options:stateless_icmp_helper`` in the ``NAT``
-  table of the
+  tracked outgoing flow. This makes Path MTU discovery (RFC 1191) and
+  traceroute work end-to-end across stateless NAT. See
+  ``options:stateless_icmp_helper`` in the ``NAT`` table of the
   ``OVN_Northbound`` database (``ovn-nb`` (5)). The priority is *P + 1* so that
   the ``exempted_ext_ips`` bypass flow (at *P + 2*) still wins for traffic
   excluded from NAT, and non-ICMP traffic falls through to the regular
@@ -2836,14 +2837,15 @@ the egress pipeline.
   For an IPv4 stateless ``dnat_and_snat`` rule that has
   ``options:stateless_icmp_helper`` set to ``true`` (the default), an
   additional priority-101 flow is added that matches ``ip && ip4.dst == B &&
-  inport == GW && icmp4 && icmp4.type == 3`` with the action
+  inport == GW && icmp4 && icmp4.type == {3, 11, 12}`` with the action
   ``ip4.dst = B; icmp4.inner_ip4.src = B; next;``. This rewrites the outer
-  destination and un-NATs the source embedded in the inbound ICMPv4
-  Destination Unreachable error payload (back to the logical IP *B*) - every
-  type-3 code quotes the original datagram, including ``Fragmentation Needed``
-  (code 4) - so that conntrack in the downstream logical switch can correlate
-  the error with the tracked outgoing flow and Path MTU discovery (RFC 1191)
-  works end-to-end across stateless NAT. See
+  destination and un-NATs the source embedded in the inbound ICMPv4 error
+  payload (back to the logical IP *B*) for every error type that quotes the
+  original datagram - Destination Unreachable (type 3, which includes
+  ``Fragmentation Needed``), Time Exceeded (type 11) and Parameter Problem
+  (type 12) - so that conntrack in the downstream logical switch can correlate
+  the error with the tracked outgoing flow. This makes Path MTU discovery
+  (RFC 1191) and traceroute work end-to-end across stateless NAT. See
   ``options:stateless_icmp_helper`` in the ``NAT`` table of the
   ``OVN_Northbound`` database (``ovn-nb`` (5)).
 
diff --git a/NEWS b/NEWS
index 18374dc71b..aeb1cc5c3f 100644
--- a/NEWS
+++ b/NEWS
@@ -37,9 +37,11 @@ Post v26.03.0
      IPv4 address embedded in an ICMPv4 error's inner packet.  ovn-northd
      uses it for stateless "dnat_and_snat" rules, controlled by the new
      "options:stateless_icmp_helper" NAT option (default true), so that
-     inbound ICMPv4 Destination Unreachable errors (type 3, including the
-     "fragmentation needed" message) generated by an external router are
-     un-NATed correctly and Path MTU discovery works through stateless NAT.
+     inbound ICMPv4 errors that quote the original datagram - Destination
+     Unreachable (type 3, including "fragmentation needed"), Time Exceeded
+     (type 11) and Parameter Problem (type 12) - generated by an external
+     router are un-NATed correctly.  This makes Path MTU discovery and
+     traceroute work through stateless NAT.
 
 OVN v26.03.0 - xxx xx xxxx
 --------------------------
diff --git a/northd/northd.c b/northd/northd.c
index 3bb9cafaac..e0820dbfb4 100644
--- a/northd/northd.c
+++ b/northd/northd.c
@@ -17669,13 +17669,16 @@ build_lrouter_in_dnat_flow(struct lflow_table *lflows,
      * the inner ip4.src back to the logical IP, that lookup fails and the
      * error is dropped as ct.inv.
      *
-     * Emit a higher-priority flow that matches the same external IP plus any
-     * ICMPv4 Destination Unreachable error (type 3) and rewrites the outer
-     * ip4.dst and the embedded inner ip4.src to the logical IP.  Every type-3
-     * code quotes the original datagram (RFC 792), so the inner-source un-NAT
-     * is correct for all of them; this covers Fragmentation Needed (code 4,
-     * for PMTUD per RFC 1191) as well as host/port unreachable and the rest,
-     * so PMTUD works end-to-end through stateless NAT. */
+     * Emit a higher-priority flow that matches the same external IP plus
+     * any ICMPv4 error that quotes the original datagram - Destination
+     * Unreachable (type 3, which includes code 4 Fragmentation Needed for
+     * PMTUD), Time Exceeded (type 11, used by traceroute) and Parameter
+     * Problem (type 12) - and rewrites the outer ip4.dst and the embedded
+     * inner ip4.src to the logical IP.  The inner-source un-NAT is correct
+     * for every such error, so PMTUD, traceroute and the other ICMP errors
+     * all work end-to-end through stateless NAT.  Redirect (type 5) is
+     * intentionally excluded: it has special NAT semantics and is not an
+     * error that needs flow correlation. */
     if (stateless && !is_v6 &&
         smap_get_bool(&nat_entry->nb->options, "stateless_icmp_helper",
                       true)) {
@@ -17684,7 +17687,8 @@ build_lrouter_in_dnat_flow(struct lflow_table *lflows,
                                                  meter_groups);
         size_t match_len = match->length;
 
-        ds_put_cstr(match, " && icmp4 && icmp4.type == 3");
+        ds_put_cstr(match,
+                    " && icmp4 && icmp4.type == {3, 11, 12}");
 
         ds_clear(actions);
         ds_put_format(actions,
diff --git a/ovn-nb.xml b/ovn-nb.xml
index 2fc8543868..33a6dc6765 100644
--- a/ovn-nb.xml
+++ b/ovn-nb.xml
@@ -5470,28 +5470,28 @@ or
         A stateless DNAT rule rewrites only the outer IPv4 destination of
         inbound packets. For an inbound ICMPv4 error (for example a
         <code>Fragmentation Needed</code> message generated for Path MTU
-        discovery, RFC 1191), the original packet embedded in the ICMP
-        payload still carries the external, post-NAT IP as its source.
-        When the error reaches the downstream logical switch, conntrack
-        cannot correlate the embedded tuple with the tracked outgoing
-        flow, the packet is marked <code>ct.inv</code> and dropped by the
-        ACL stage, and PMTU discovery breaks.
+        discovery, RFC 1191, or a <code>Time Exceeded</code> message
+        generated for a traceroute probe), the original packet embedded in
+        the ICMP payload still carries the external, post-NAT IP as its
+        source. When the error reaches the downstream logical switch,
+        conntrack cannot correlate the embedded tuple with the tracked
+        outgoing flow, the packet is marked <code>ct.inv</code> and dropped
+        by the ACL stage, and PMTU discovery and traceroute break.
       </p>
 
       <p>
         When this option is <code>true</code>, <code>ovn-northd</code>
         emits an additional, higher-priority logical flow in the router
-        ingress DNAT stage that matches ICMPv4 Destination Unreachable
-        (type 3) errors destined to the external IP. Every type-3 code
-        quotes the original datagram, so this covers
-        <code>Fragmentation Needed</code> (code 4) for PMTU discovery as
-        well as host/port unreachable and the rest. It rewrites the outer
-        IPv4 destination to the logical IP and, using the
-        <code>icmp4.inner_ip4.src</code> action, un-NATs the embedded
+        ingress DNAT stage that matches the ICMPv4 errors which quote the
+        original datagram - Destination Unreachable (type 3, which includes
+        <code>Fragmentation Needed</code>), Time Exceeded (type 11) and
+        Parameter Problem (type 12) - destined to the external IP. It
+        rewrites the outer IPv4 destination to the logical IP and, using
+        the <code>icmp4.inner_ip4.src</code> action, un-NATs the embedded
         inner IPv4 source from the external IP back to the logical IP, so
-        that conntrack can correlate the error and PMTU discovery works
-        end-to-end. Set it to <code>false</code> to suppress this flow for
-        an individual NAT entry.
+        that conntrack can correlate the error. This makes PMTU discovery
+        and traceroute work end-to-end. Set it to <code>false</code> to
+        suppress this flow for an individual NAT entry.
       </p>
     </column>
 
diff --git a/tests/ovn.at b/tests/ovn.at
index e43579cdd7..d67ca2ead5 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -22816,6 +22816,29 @@ as hv1 reset_pcap_file hv1-vif1 hv1/vif1
 
 check as hv1 ovs-appctl netdev-dummy/receive hv1-vif2 $packet
 
+# ICMP Time Exceeded error from a transit router to the NAT external IP.
+# The embedded original packet is the VM's outbound probe after stateless
+# SNAT, so its inner source is the external IP and its inner destination is
+# the traceroute target (50.0.0.100).
+packet=$(fmt_pkt "Ether(dst='$rtr_ext_mac', src='$ext1_mac')/ \
+                  IP(src='$ext1_ip', dst='$nat_ext_ip', ttl=64)/ \
+                  ICMP(type=11, code=0)/ \
+                  IP(src='$nat_ext_ip', dst='50.0.0.100', ttl=1, proto=17)/ \
+                  bytes.fromhex('$inner_l4')")
+
+# Expected packet delivered to vm1: outer destination DNATed to 10.0.0.3,
+# outer TTL decremented, L2 rewritten, and (via pinctrl) the inner IPv4
+# source un-NATed to 10.0.0.3.  The inner destination (50.0.0.100) is left
+# unchanged.
+expected=$(fmt_pkt "Ether(dst='$vm1_mac', src='$rtr_int_mac')/ \
+                    IP(src='$ext1_ip', dst='$vm1_ip', ttl=63)/ \
+                    ICMP(type=11, code=0)/ \
+                    IP(src='$vm1_ip', dst='50.0.0.100', ttl=1, proto=17)/ \
+                    bytes.fromhex('$inner_l4')")
+echo $expected >> vif1.expected
+
+check as hv1 ovs-appctl netdev-dummy/receive hv1-vif2 $packet
+
 OVN_CHECK_PACKETS([hv1/vif1-tx.pcap], [vif1.expected])
 
 OVN_CLEANUP([hv1])
-- 
2.54.0

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to