From: Numan Siddique <[email protected]>

MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When a logical router uses stateless NAT (a dnat_and_snat rule with
options:stateless=true, i.e. a "stateless floating IP"), the NAT is a
pure header rewrite with no connection tracking: outbound packets get
their source rewritten to the external IP, and inbound packets get their
destination rewritten back to the logical IP.

This works for ordinary traffic, but it breaks inbound ICMPv4 errors.
Such an error embeds a copy of the original datagram that triggered it.
For traffic the workload sent out, that embedded datagram has already
been SNATed, so its inner source is the external (post-NAT) IP.  When the
error reaches the workload's logical switch, conntrack tries to correlate
the embedded tuple with the tracked outgoing flow; the inner source is
the external IP rather than the logical IP, so the lookup fails, the
packet is marked ct.inv, and the ACL stage drops it.  The workload never
sees the error.  This breaks Path MTU discovery (RFC 1191, "fragmentation
needed" errors) and traceroute ("time exceeded" errors), among others.

With stateful NAT this "just works", because conntrack rewrites the
embedded header of related ICMP errors automatically.  Stateless NAT has
no such machinery, so the inner header must be fixed up explicitly.

This series adds that fix-up:

  - A new OVN action field, icmp4.inner_ip4.src, rewrites the source
    address of the IPv4 header embedded in an ICMPv4 error.  Because it
    mutates bytes inside the ICMP payload, it is implemented as a
    controller (pinctrl) action.  It fixes the inner IPv4 header checksum
    incrementally; no outer ICMP checksum update is needed, because the
    inner-source and inner-IP-checksum deltas cancel within the ICMP
    payload (see patch 2 for the details).

  - ovn-northd emits this action, for each IPv4 stateless dnat_and_snat
    rule, in a higher-priority router-ingress DNAT flow that matches the
    inbound ICMPv4 errors which quote the original datagram - Destination
    Unreachable (type 3, all codes, including code 4 Fragmentation
    Needed), Time Exceeded (type 11) and Parameter Problem (type 12) - and
    un-NATs the embedded inner source back to the logical IP.  Every one
    of those codes/types quotes the datagram (RFC 792), so the rewrite is
    correct for all of them.  It is gated by a per-NAT option,
    options:stateless_icmp_helper, which defaults to true, and the flow is
    emitted with the router's icmp4-error CoPP meter so the controller
    punt is rate-limited.

Why this is needed
------------------

OVN already has the gateway_mtu feature, where ovn-controller itself
generates the ICMP "fragmentation needed" error when a packet exceeds the
configured logical router port MTU.  In that path the error originates
inside OVN with correct addressing, so no inner rewrite is required.

This series targets deployments where the CMS does NOT use gateway_mtu
and the ICMP error is instead generated by an external router on the path
- for example a provider-edge (PE) router beyond the OVN gateway.  That
error arrives at OVN from the outside, destined to the stateless NAT
external IP, with its embedded inner source still carrying the external
IP.  Only OVN can un-NAT that inner source back to the logical IP, which
is exactly what icmp4.inner_ip4.src does.

Only IPv4 is wired up; IPv6 stateless NAT is not currently supported in
OVN.

Testing
-------

  - ovn-northd flow-generation tests covering presence/absence of the
    helper flow (options:stateless_icmp_helper), the resulting flow
    counts, and that the flow carries the icmp4-error CoPP meter.
  - ovn.at end-to-end tests that inject an ICMPv4 Fragmentation Needed
    (PMTUD) error and an ICMPv4 Time Exceeded (traceroute) error through a
    stateless dnat_and_snat router and verify the embedded inner source is
    un-NATed (and the inner destination is left untouched).
  - A fake-multinode (tests/multinode.at) test reproducing the full
    PMTUD scenario: an external namespace acting as a router emits the
    frag-needed error, and the workload learns the reduced PMTU only
    because OVN rewrites the inner source.  Verified on a live
    ovn-fake-multinode deployment.

v1 -> v2
--------

  - pinctrl (patch 2): per Ilya's review, fix the inner IPv4 header
    checksum incrementally instead of recomputing it (so a bad inner
    checksum is not laundered), drop the now-unnecessary outer ICMP
    checksum update (the inner-source and inner-csum deltas cancel),
    restructure the bounds checks to use lengths so no out-of-bounds
    pointer is ever formed, drop the redundant offset copies after
    dp_packet_clone(), and skip later IPv4 fragments (which carry no L4
    header).  Documented that the inner L4 checksum is left stale on
    purpose.
  - northd (patch 3): match all ICMPv4 Destination Unreachable codes
    (icmp4.type == 3) rather than only code 4 - every type-3 code quotes
    the original datagram, so restricting to code 4 was unnecessary and
    left host/port unreachable broken.  Also emit the helper flow with the
    router's icmp4-error CoPP meter to rate-limit the punt, with a test
    asserting the meter is applied.
  - northd (patch 4, new): per Xie Liu's suggestion, also match Time
    Exceeded (type 11) and Parameter Problem (type 12), so traceroute and
    the other quote-bearing ICMPv4 errors work through stateless NAT too,
    not just the unreachable cases.  Adds a Time Exceeded end-to-end test.


Numan Siddique (4):
  ovn-fields: Add icmp4.inner_ip4.src to rewrite ICMP inner source.
  pinctrl: Implement put_icmp4_inner_ip4_src action.
  northd: Emit inner-IP rewrite flow for stateless DNAT.
  northd: Extend stateless ICMP helper to more ICMPv4 error types.

 Documentation/ref/ovn-logical-flows.7.rst |  34 ++++++
 NEWS                                      |   9 ++
 controller/pinctrl.c                      | 104 ++++++++++++++++
 include/ovn/actions.h                     |   8 +-
 include/ovn/logical-fields.h              |   9 ++
 lib/actions.c                             |  38 +++++-
 lib/logical-fields.c                      |  11 +-
 northd/northd.c                           |  52 +++++++-
 ovn-nb.xml                                |  38 ++++++
 ovn-sb.xml                                |  27 ++++-
 tests/multinode.at                        |  73 ++++++++++++
 tests/ovn-northd.at                       |  35 +++++-
 tests/ovn.at                              | 139 +++++++++++++++++++++-
 utilities/ovn-trace.c                     |   6 +
 14 files changed, 571 insertions(+), 12 deletions(-)

-- 
2.54.0

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to