On Tue,  8 Jul 2025 13:34:02 +0200
Ilya Maximets <i.maxim...@ovn.org> wrote:

> When a packet enters kernel datapath and there is no flow to handle it,
> packet goes to userspace through a MISS upcall.  With per-CPU upcall
> dispatch mechanism, we're using the current CPU id to select the
> Netlink PID on which to send this packet.  This allows us to send
> packets from the same traffic flow through the same handler.
> 
> The handler will process the packet, install required flow into the
> kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.
> 
> While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
> recirculation action that will pass the (likely modified) packet
> through the flow lookup again.  And if the flow is not found, the
> packet will be sent to userspace again through another MISS upcall.
> 
> However, the handler thread in userspace is likely running on a
> different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
> in the syscall context of that thread.  So, when the time comes to
> send the packet through another upcall, the per-CPU dispatch will
> choose a different Netlink PID, and this packet will end up processed
> by a different handler thread on a different CPU.
> 
> The process continues as long as there are new recirculations, each
> time the packet goes to a different handler thread before it is sent
> out of the OVS datapath to the destination port.  In real setups the
> number of recirculations can go up to 4 or 5, sometimes more.
> 
> There is always a chance to re-order packets while processing upcalls,
> because userspace will first install the flow and then re-inject the
> original packet.  So, there is a race window when the flow is already
> installed and the second packet can match it inside the kernel and be
> forwarded to the destination before the first packet is re-injected.
> But the fact that packets are going through multiple upcalls handled
> by different userspace threads makes the reordering noticeably more
> likely, because we not only have a race between the kernel and a
> userspace handler (which is hard to avoid), but also between multiple
> userspace handlers.
> 
> For example, let's assume that 10 packets got enqueued through a MISS
> upcall for handler-1, it will start processing them, will install the
> flow into the kernel and start re-injecting packets back, from where
> they will go through another MISS to handler-2.  Handler-2 will install
> the flow into the kernel and start re-injecting the packets, while
> handler-1 continues to re-inject the last of the 10 packets, they will
> hit the flow installed by handler-2 and be forwarded without going to
> the handler-2, while handler-2 still re-injects the first of these 10
> packets.  Given multiple recirculations and misses, these 10 packets
> may end up completely mixed up on the output from the datapath.
> 
> Let's provide the original upcall PID via the new ntlink attribute
> OVS_PACKET_ATTR_UPCALL_PID.  This way the upcall triggered during the
> execution will go to the same handler.  Packets will be enqueued to
> the same socket and re-injected in the same order.  This doesn't
> eliminate re-ordering as stated above, since we still have a race
> between the kernel and the handler thread, but it allows to eliminate
> races between multiple handlers.
> 
> The openvswitch kernel module ignores unknown attributes for the
> OVS_PACKET_CMD_EXECUTE, so it's safe to provide it even on older
> kernels.
> 
> Reported-at: https://issues.redhat.com/browse/FDP-1479
> Link:
> https://lore.kernel.org/netdev/20250702155043.2331772-1-i.maxim...@ovn.org/
> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> 
>
>---
> 
> Version 1:
>   * No changes since RFC.  The kernel change got merged into net-next:
>       
> https://lore.kernel.org/netdev/20250702155043.2331772-1-i.maxim...@ovn.org/
>     Normally, we would wait for it to be in the Linus' tree, but it will
>     not happen before the branching.


Acked-by: Flavio Leitner <f...@sysclose.org>




> 
>  include/linux/openvswitch.h   | 6 ++++++
>  lib/dpif-netlink.c            | 7 +++++++
>  lib/dpif.h                    | 3 +++
>  ofproto/ofproto-dpif-upcall.c | 5 +++++
>  4 files changed, 21 insertions(+)
> 
> diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
> index e09ed2b2f..e69fed4b7 100644
> --- a/include/linux/openvswitch.h
> +++ b/include/linux/openvswitch.h
> @@ -202,6 +202,11 @@ enum ovs_packet_cmd {
>   * @OVS_PACKET_ATTR_LEN: Packet size before truncation.
>   * %OVS_PACKET_ATTR_USERSPACE action specify the Maximum received fragment
>   * size.
> + * @OVS_PACKET_ATTR_UPCALL_PID: Netlink PID to use for upcalls while
> + * processing %OVS_PACKET_CMD_EXECUTE.  Takes precedence over all other
> ways
> + * to determine the Netlink PID including %OVS_USERSPACE_ATTR_PID,
> + * %OVS_DP_ATTR_UPCALL_PID, %OVS_DP_ATTR_PER_CPU_PIDS and the
> + * %OVS_VPORT_ATTR_UPCALL_PID.
>   *
>   * These attributes follow the &struct ovs_header within the Generic
> Netlink
>   * payload for %OVS_PACKET_* commands.
> @@ -221,6 +226,7 @@ enum ovs_packet_attr {
>       OVS_PACKET_ATTR_MRU,        /* Maximum received IP fragment
> size. */ OVS_PACKET_ATTR_LEN,         /* Packet size before
> truncation. */ OVS_PACKET_ATTR_HASH,          /* Packet hash. */
> +     OVS_PACKET_ATTR_UPCALL_PID, /* u32 Netlink PID. */
>       __OVS_PACKET_ATTR_MAX
>  };
>  
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index f8850181d..7587c9c3e 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -2021,6 +2021,10 @@ dpif_netlink_encode_execute(int dp_ifindex, const
> struct dpif_execute *d_exec, if (d_exec->hash) {
>          nl_msg_put_u64(buf, OVS_PACKET_ATTR_HASH, d_exec->hash);
>      }
> +
> +    if (d_exec->upcall_pid) {
> +        nl_msg_put_u32(buf, OVS_PACKET_ATTR_UPCALL_PID,
> d_exec->upcall_pid);
> +    }
>  }
>  
>  /* Executes, against 'dpif', up to the first 'n_ops' operations in 'ops'.
> @@ -2987,6 +2991,7 @@ dpif_netlink_recv_windows(struct dpif_netlink *dpif,
> uint32_t handler_id, 
>              error = parse_odp_packet(buf, upcall, &dp_ifindex);
>              if (!error && dp_ifindex == dpif->dp_ifindex) {
> +                upcall->pid = 0;
>                  return 0;
>              } else if (error) {
>                  return error;
> @@ -3037,6 +3042,7 @@ dpif_netlink_recv_cpu_dispatch(struct dpif_netlink
> *dpif, uint32_t handler_id, 
>          error = parse_odp_packet(buf, upcall, &dp_ifindex);
>          if (!error && dp_ifindex == dpif->dp_ifindex) {
> +            upcall->pid = nl_sock_pid(handler->sock);
>              return 0;
>          } else if (error) {
>              return error;
> @@ -3113,6 +3119,7 @@ dpif_netlink_recv_vport_dispatch(struct dpif_netlink
> *dpif, 
>              error = parse_odp_packet(buf, upcall, &dp_ifindex);
>              if (!error && dp_ifindex == dpif->dp_ifindex) {
> +                upcall->pid = nl_sock_pid(ch->sock);
>                  return 0;
>              } else if (error) {
>                  return error;
> diff --git a/lib/dpif.h b/lib/dpif.h
> index 6bef7d5b3..0b685d8df 100644
> --- a/lib/dpif.h
> +++ b/lib/dpif.h
> @@ -732,6 +732,8 @@ struct dpif_execute {
>      unsigned int mtu;               /* Maximum transmission unit to
> fragment. 0 if not a fragmented packet */
>      uint64_t hash;                  /* Packet flow hash. 0 if not
> specified. */
> +    uint64_t upcall_pid;            /* Netlink PID to use for upcalls.
> +                                     * 0 if not specified. */
>      const struct flow *flow;         /* Flow extracted from 'packet'. */
>  
>      /* Input, but possibly modified as a side effect of execution. */
> @@ -833,6 +835,7 @@ struct dpif_upcall {
>      struct nlattr *mru;         /* Maximum receive unit. */
>      struct nlattr *hash;        /* Packet hash. */
>      struct nlattr *cutlen;      /* Number of bytes shrink from the end. */
> +    uint32_t pid;               /* Socket PID the upcall was received
> from. */ 
>      /* DPIF_UC_ACTION only. */
>      struct nlattr *userdata;    /* Argument to OVS_ACTION_ATTR_USERSPACE.
> */ diff --git a/ofproto/ofproto-dpif-upcall.c
> b/ofproto/ofproto-dpif-upcall.c index 53e59580d..9dfa52d82 100644
> --- a/ofproto/ofproto-dpif-upcall.c
> +++ b/ofproto/ofproto-dpif-upcall.c
> @@ -230,6 +230,8 @@ struct upcall {
>      uint16_t mru;                  /* If !0, Maximum receive unit of
>                                        fragmented IP packet */
>      uint64_t hash;
> +    uint32_t pid;                  /* Socket PID this upcall was received
> from,
> +                                    * or zero. */
>  
>      enum upcall_type type;         /* Type of the upcall. */
>      const struct nlattr *actions;  /* Flow actions in DPIF_UC_ACTION
> Upcalls. */ @@ -932,6 +934,7 @@ recv_upcalls(struct handler *handler)
>          upcall->key_len = dupcall->key_len;
>          upcall->ufid = &dupcall->ufid;
>          upcall->hash = hash;
> +        upcall->pid = dupcall->pid;
>  
>          upcall->out_tun_key = dupcall->out_tun_key;
>          upcall->actions = dupcall->actions;
> @@ -1271,6 +1274,7 @@ upcall_receive(struct upcall *upcall, const struct
> dpif_backer *backer, upcall->key = NULL;
>      upcall->key_len = 0;
>      upcall->mru = mru;
> +    upcall->pid = 0;
>  
>      upcall->out_tun_key = NULL;
>      upcall->actions = NULL;
> @@ -1730,6 +1734,7 @@ handle_upcalls(struct udpif *udpif, struct upcall
> *upcalls, op->dop.execute.probe = false;
>              op->dop.execute.mtu = upcall->mru;
>              op->dop.execute.hash = upcall->hash;
> +            op->dop.execute.upcall_pid = upcall->pid;
>          }
>      }
>  

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to