From: Numan Siddique <[email protected]>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When a logical router uses stateless NAT (a dnat_and_snat rule with
options:stateless=true, i.e. a "stateless floating IP"), the NAT is a
pure header rewrite with no connection tracking: outbound packets get
their source rewritten to the external IP, and inbound packets get their
destination rewritten back to the logical IP.
This works for ordinary traffic, but it breaks inbound ICMPv4 errors.
Such an error embeds a copy of the original datagram that triggered it.
For traffic the workload sent out, that embedded datagram has already
been SNATed, so its inner source is the external (post-NAT) IP. When the
error reaches the workload's logical switch, conntrack tries to correlate
the embedded tuple with the tracked outgoing flow; the inner source is
the external IP rather than the logical IP, so the lookup fails, the
packet is marked ct.inv, and the ACL stage drops it. The workload never
sees the error. This breaks Path MTU discovery (RFC 1191, "fragmentation
needed" errors) and traceroute ("time exceeded" errors), among others.
With stateful NAT this "just works", because conntrack rewrites the
embedded header of related ICMP errors automatically. Stateless NAT has
no such machinery, so the inner header must be fixed up explicitly.
This series adds that fix-up:
- A new OVN action field, icmp4.inner_ip4.src, rewrites the source
address of the IPv4 header embedded in an ICMPv4 error. Because it
mutates bytes inside the ICMP payload, it is implemented as a
controller (pinctrl) action. It fixes the inner IPv4 header checksum
incrementally; no outer ICMP checksum update is needed, because the
inner-source and inner-IP-checksum deltas cancel within the ICMP
payload (see patch 2 for the details).
- ovn-northd emits this action, for each IPv4 stateless dnat_and_snat
rule, in a higher-priority router-ingress DNAT flow that matches the
inbound ICMPv4 errors which quote the original datagram - Destination
Unreachable (type 3, all codes, including code 4 Fragmentation
Needed), Time Exceeded (type 11) and Parameter Problem (type 12) - and
un-NATs the embedded inner source back to the logical IP. Every one
of those codes/types quotes the datagram (RFC 792), so the rewrite is
correct for all of them. It is gated by a per-NAT option,
options:stateless_icmp_helper, which defaults to true, and the flow is
emitted with the router's icmp4-error CoPP meter so the controller
punt is rate-limited.
Why this is needed
------------------
OVN already has the gateway_mtu feature, where ovn-controller itself
generates the ICMP "fragmentation needed" error when a packet exceeds the
configured logical router port MTU. In that path the error originates
inside OVN with correct addressing, so no inner rewrite is required.
This series targets deployments where the CMS does NOT use gateway_mtu
and the ICMP error is instead generated by an external router on the path
- for example a provider-edge (PE) router beyond the OVN gateway. That
error arrives at OVN from the outside, destined to the stateless NAT
external IP, with its embedded inner source still carrying the external
IP. Only OVN can un-NAT that inner source back to the logical IP, which
is exactly what icmp4.inner_ip4.src does.
Only IPv4 is wired up; IPv6 stateless NAT is not currently supported in
OVN.
Testing
-------
- ovn-northd flow-generation tests covering presence/absence of the
helper flow (options:stateless_icmp_helper), the resulting flow
counts, and that the flow carries the icmp4-error CoPP meter.
- ovn.at end-to-end tests that inject an ICMPv4 Fragmentation Needed
(PMTUD) error and an ICMPv4 Time Exceeded (traceroute) error through a
stateless dnat_and_snat router and verify the embedded inner source is
un-NATed (and the inner destination is left untouched).
- A fake-multinode (tests/multinode.at) test reproducing the full
PMTUD scenario: an external namespace acting as a router emits the
frag-needed error, and the workload learns the reduced PMTU only
because OVN rewrites the inner source. Verified on a live
ovn-fake-multinode deployment.
v1 -> v2
--------
- pinctrl (patch 2): per Ilya's review, fix the inner IPv4 header
checksum incrementally instead of recomputing it (so a bad inner
checksum is not laundered), drop the now-unnecessary outer ICMP
checksum update (the inner-source and inner-csum deltas cancel),
restructure the bounds checks to use lengths so no out-of-bounds
pointer is ever formed, drop the redundant offset copies after
dp_packet_clone(), and skip later IPv4 fragments (which carry no L4
header). Documented that the inner L4 checksum is left stale on
purpose.
- northd (patch 3): match all ICMPv4 Destination Unreachable codes
(icmp4.type == 3) rather than only code 4 - every type-3 code quotes
the original datagram, so restricting to code 4 was unnecessary and
left host/port unreachable broken. Also emit the helper flow with the
router's icmp4-error CoPP meter to rate-limit the punt, with a test
asserting the meter is applied.
- northd (patch 4, new): per Xie Liu's suggestion, also match Time
Exceeded (type 11) and Parameter Problem (type 12), so traceroute and
the other quote-bearing ICMPv4 errors work through stateless NAT too,
not just the unreachable cases. Adds a Time Exceeded end-to-end test.
Numan Siddique (4):
ovn-fields: Add icmp4.inner_ip4.src to rewrite ICMP inner source.
pinctrl: Implement put_icmp4_inner_ip4_src action.
northd: Emit inner-IP rewrite flow for stateless DNAT.
northd: Extend stateless ICMP helper to more ICMPv4 error types.
Documentation/ref/ovn-logical-flows.7.rst | 34 ++++++
NEWS | 9 ++
controller/pinctrl.c | 104 ++++++++++++++++
include/ovn/actions.h | 8 +-
include/ovn/logical-fields.h | 9 ++
lib/actions.c | 38 +++++-
lib/logical-fields.c | 11 +-
northd/northd.c | 52 +++++++-
ovn-nb.xml | 38 ++++++
ovn-sb.xml | 27 ++++-
tests/multinode.at | 73 ++++++++++++
tests/ovn-northd.at | 35 +++++-
tests/ovn.at | 139 +++++++++++++++++++++-
utilities/ovn-trace.c | 6 +
14 files changed, 571 insertions(+), 12 deletions(-)
--
2.54.0
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev