When a packet enters kernel datapath and there is no flow to handle it, packet goes to userspace through a MISS upcall. With per-CPU upcall dispatch mechanism, we're using the current CPU id to select the Netlink PID on which to send this packet. This allows us to send packets from the same traffic flow through the same handler.
The handler will process the packet, install required flow into the kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE. While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a recirculation action that will pass the (likely modified) packet through the flow lookup again. And if the flow is not found, the packet will be sent to userspace again through another MISS upcall. However, the handler thread in userspace is likely running on a different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled in the syscall context of that thread. So, when the time comes to send the packet through another upcall, the per-CPU dispatch will choose a different Netlink PID, and this packet will end up processed by a different handler thread on a different CPU. The process continues as long as there are new recirculations, each time the packet goes to a different handler thread before it is sent out of the OVS datapath to the destination port. In real setups the number of recirculations can go up to 4 or 5, sometimes more. There is always a chance to re-order packets while processing upcalls, because userspace will first install the flow and then re-inject the original packet. So, there is a race window when the flow is already installed and the second packet can match it inside the kernel and be forwarded to the destination before the first packet is re-injected. But the fact that packets are going through multiple upcalls handled by different userspace threads makes the reordering noticeably more likely, because we not only have a race between the kernel and a userspace handler (which is hard to avoid), but also between multiple userspace handlers. For example, let's assume that 10 packets got enqueued through a MISS upcall for handler-1, it will start processing them, will install the flow into the kernel and start re-injecting packets back, from where they will go through another MISS to handler-2. Handler-2 will install the flow into the kernel and start re-injecting the packets, while handler-1 continues to re-inject the last of the 10 packets, they will hit the flow installed by handler-2 and be forwarded without going to the handler-2, while handler-2 still re-injects the first of these 10 packets. Given multiple recirculations and misses, these 10 packets may end up completely mixed up on the output from the datapath. Let's provide the original upcall PID via the new ntlink attribute OVS_PACKET_ATTR_UPCALL_PID. This way the upcall triggered during the execution will go to the same handler. Packets will be enqueued to the same socket and re-injected in the same order. This doesn't eliminate re-ordering as stated above, since we still have a race between the kernel and the handler thread, but it allows to eliminate races between multiple handlers. The openvswitch kernel module ignores unknown attributes for the OVS_PACKET_CMD_EXECUTE, so it's safe to provide it even on older kernels. Reported-at: https://issues.redhat.com/browse/FDP-1479 Link: https://lore.kernel.org/netdev/20250702155043.2331772-1-i.maxim...@ovn.org/ Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> --- Version 1: * No changes since RFC. The kernel change got merged into net-next: https://lore.kernel.org/netdev/20250702155043.2331772-1-i.maxim...@ovn.org/ Normally, we would wait for it to be in the Linus' tree, but it will not happen before the branching. include/linux/openvswitch.h | 6 ++++++ lib/dpif-netlink.c | 7 +++++++ lib/dpif.h | 3 +++ ofproto/ofproto-dpif-upcall.c | 5 +++++ 4 files changed, 21 insertions(+) diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h index e09ed2b2f..e69fed4b7 100644 --- a/include/linux/openvswitch.h +++ b/include/linux/openvswitch.h @@ -202,6 +202,11 @@ enum ovs_packet_cmd { * @OVS_PACKET_ATTR_LEN: Packet size before truncation. * %OVS_PACKET_ATTR_USERSPACE action specify the Maximum received fragment * size. + * @OVS_PACKET_ATTR_UPCALL_PID: Netlink PID to use for upcalls while + * processing %OVS_PACKET_CMD_EXECUTE. Takes precedence over all other ways + * to determine the Netlink PID including %OVS_USERSPACE_ATTR_PID, + * %OVS_DP_ATTR_UPCALL_PID, %OVS_DP_ATTR_PER_CPU_PIDS and the + * %OVS_VPORT_ATTR_UPCALL_PID. * * These attributes follow the &struct ovs_header within the Generic Netlink * payload for %OVS_PACKET_* commands. @@ -221,6 +226,7 @@ enum ovs_packet_attr { OVS_PACKET_ATTR_MRU, /* Maximum received IP fragment size. */ OVS_PACKET_ATTR_LEN, /* Packet size before truncation. */ OVS_PACKET_ATTR_HASH, /* Packet hash. */ + OVS_PACKET_ATTR_UPCALL_PID, /* u32 Netlink PID. */ __OVS_PACKET_ATTR_MAX }; diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c index f8850181d..7587c9c3e 100644 --- a/lib/dpif-netlink.c +++ b/lib/dpif-netlink.c @@ -2021,6 +2021,10 @@ dpif_netlink_encode_execute(int dp_ifindex, const struct dpif_execute *d_exec, if (d_exec->hash) { nl_msg_put_u64(buf, OVS_PACKET_ATTR_HASH, d_exec->hash); } + + if (d_exec->upcall_pid) { + nl_msg_put_u32(buf, OVS_PACKET_ATTR_UPCALL_PID, d_exec->upcall_pid); + } } /* Executes, against 'dpif', up to the first 'n_ops' operations in 'ops'. @@ -2987,6 +2991,7 @@ dpif_netlink_recv_windows(struct dpif_netlink *dpif, uint32_t handler_id, error = parse_odp_packet(buf, upcall, &dp_ifindex); if (!error && dp_ifindex == dpif->dp_ifindex) { + upcall->pid = 0; return 0; } else if (error) { return error; @@ -3037,6 +3042,7 @@ dpif_netlink_recv_cpu_dispatch(struct dpif_netlink *dpif, uint32_t handler_id, error = parse_odp_packet(buf, upcall, &dp_ifindex); if (!error && dp_ifindex == dpif->dp_ifindex) { + upcall->pid = nl_sock_pid(handler->sock); return 0; } else if (error) { return error; @@ -3113,6 +3119,7 @@ dpif_netlink_recv_vport_dispatch(struct dpif_netlink *dpif, error = parse_odp_packet(buf, upcall, &dp_ifindex); if (!error && dp_ifindex == dpif->dp_ifindex) { + upcall->pid = nl_sock_pid(ch->sock); return 0; } else if (error) { return error; diff --git a/lib/dpif.h b/lib/dpif.h index 6bef7d5b3..0b685d8df 100644 --- a/lib/dpif.h +++ b/lib/dpif.h @@ -732,6 +732,8 @@ struct dpif_execute { unsigned int mtu; /* Maximum transmission unit to fragment. 0 if not a fragmented packet */ uint64_t hash; /* Packet flow hash. 0 if not specified. */ + uint64_t upcall_pid; /* Netlink PID to use for upcalls. + * 0 if not specified. */ const struct flow *flow; /* Flow extracted from 'packet'. */ /* Input, but possibly modified as a side effect of execution. */ @@ -833,6 +835,7 @@ struct dpif_upcall { struct nlattr *mru; /* Maximum receive unit. */ struct nlattr *hash; /* Packet hash. */ struct nlattr *cutlen; /* Number of bytes shrink from the end. */ + uint32_t pid; /* Socket PID the upcall was received from. */ /* DPIF_UC_ACTION only. */ struct nlattr *userdata; /* Argument to OVS_ACTION_ATTR_USERSPACE. */ diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c index 53e59580d..9dfa52d82 100644 --- a/ofproto/ofproto-dpif-upcall.c +++ b/ofproto/ofproto-dpif-upcall.c @@ -230,6 +230,8 @@ struct upcall { uint16_t mru; /* If !0, Maximum receive unit of fragmented IP packet */ uint64_t hash; + uint32_t pid; /* Socket PID this upcall was received from, + * or zero. */ enum upcall_type type; /* Type of the upcall. */ const struct nlattr *actions; /* Flow actions in DPIF_UC_ACTION Upcalls. */ @@ -932,6 +934,7 @@ recv_upcalls(struct handler *handler) upcall->key_len = dupcall->key_len; upcall->ufid = &dupcall->ufid; upcall->hash = hash; + upcall->pid = dupcall->pid; upcall->out_tun_key = dupcall->out_tun_key; upcall->actions = dupcall->actions; @@ -1271,6 +1274,7 @@ upcall_receive(struct upcall *upcall, const struct dpif_backer *backer, upcall->key = NULL; upcall->key_len = 0; upcall->mru = mru; + upcall->pid = 0; upcall->out_tun_key = NULL; upcall->actions = NULL; @@ -1730,6 +1734,7 @@ handle_upcalls(struct udpif *udpif, struct upcall *upcalls, op->dop.execute.probe = false; op->dop.execute.mtu = upcall->mru; op->dop.execute.hash = upcall->hash; + op->dop.execute.upcall_pid = upcall->pid; } } -- 2.49.0 _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev