On Tue, 8 Jul 2025 13:34:02 +0200 Ilya Maximets <i.maxim...@ovn.org> wrote:
> When a packet enters kernel datapath and there is no flow to handle it, > packet goes to userspace through a MISS upcall. With per-CPU upcall > dispatch mechanism, we're using the current CPU id to select the > Netlink PID on which to send this packet. This allows us to send > packets from the same traffic flow through the same handler. > > The handler will process the packet, install required flow into the > kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE. > > While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a > recirculation action that will pass the (likely modified) packet > through the flow lookup again. And if the flow is not found, the > packet will be sent to userspace again through another MISS upcall. > > However, the handler thread in userspace is likely running on a > different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled > in the syscall context of that thread. So, when the time comes to > send the packet through another upcall, the per-CPU dispatch will > choose a different Netlink PID, and this packet will end up processed > by a different handler thread on a different CPU. > > The process continues as long as there are new recirculations, each > time the packet goes to a different handler thread before it is sent > out of the OVS datapath to the destination port. In real setups the > number of recirculations can go up to 4 or 5, sometimes more. > > There is always a chance to re-order packets while processing upcalls, > because userspace will first install the flow and then re-inject the > original packet. So, there is a race window when the flow is already > installed and the second packet can match it inside the kernel and be > forwarded to the destination before the first packet is re-injected. > But the fact that packets are going through multiple upcalls handled > by different userspace threads makes the reordering noticeably more > likely, because we not only have a race between the kernel and a > userspace handler (which is hard to avoid), but also between multiple > userspace handlers. > > For example, let's assume that 10 packets got enqueued through a MISS > upcall for handler-1, it will start processing them, will install the > flow into the kernel and start re-injecting packets back, from where > they will go through another MISS to handler-2. Handler-2 will install > the flow into the kernel and start re-injecting the packets, while > handler-1 continues to re-inject the last of the 10 packets, they will > hit the flow installed by handler-2 and be forwarded without going to > the handler-2, while handler-2 still re-injects the first of these 10 > packets. Given multiple recirculations and misses, these 10 packets > may end up completely mixed up on the output from the datapath. > > Let's provide the original upcall PID via the new ntlink attribute > OVS_PACKET_ATTR_UPCALL_PID. This way the upcall triggered during the > execution will go to the same handler. Packets will be enqueued to > the same socket and re-injected in the same order. This doesn't > eliminate re-ordering as stated above, since we still have a race > between the kernel and the handler thread, but it allows to eliminate > races between multiple handlers. > > The openvswitch kernel module ignores unknown attributes for the > OVS_PACKET_CMD_EXECUTE, so it's safe to provide it even on older > kernels. > > Reported-at: https://issues.redhat.com/browse/FDP-1479 > Link: > https://lore.kernel.org/netdev/20250702155043.2331772-1-i.maxim...@ovn.org/ > Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> > >--- > > Version 1: > * No changes since RFC. The kernel change got merged into net-next: > > https://lore.kernel.org/netdev/20250702155043.2331772-1-i.maxim...@ovn.org/ > Normally, we would wait for it to be in the Linus' tree, but it will > not happen before the branching. Acked-by: Flavio Leitner <f...@sysclose.org> > > include/linux/openvswitch.h | 6 ++++++ > lib/dpif-netlink.c | 7 +++++++ > lib/dpif.h | 3 +++ > ofproto/ofproto-dpif-upcall.c | 5 +++++ > 4 files changed, 21 insertions(+) > > diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h > index e09ed2b2f..e69fed4b7 100644 > --- a/include/linux/openvswitch.h > +++ b/include/linux/openvswitch.h > @@ -202,6 +202,11 @@ enum ovs_packet_cmd { > * @OVS_PACKET_ATTR_LEN: Packet size before truncation. > * %OVS_PACKET_ATTR_USERSPACE action specify the Maximum received fragment > * size. > + * @OVS_PACKET_ATTR_UPCALL_PID: Netlink PID to use for upcalls while > + * processing %OVS_PACKET_CMD_EXECUTE. Takes precedence over all other > ways > + * to determine the Netlink PID including %OVS_USERSPACE_ATTR_PID, > + * %OVS_DP_ATTR_UPCALL_PID, %OVS_DP_ATTR_PER_CPU_PIDS and the > + * %OVS_VPORT_ATTR_UPCALL_PID. > * > * These attributes follow the &struct ovs_header within the Generic > Netlink > * payload for %OVS_PACKET_* commands. > @@ -221,6 +226,7 @@ enum ovs_packet_attr { > OVS_PACKET_ATTR_MRU, /* Maximum received IP fragment > size. */ OVS_PACKET_ATTR_LEN, /* Packet size before > truncation. */ OVS_PACKET_ATTR_HASH, /* Packet hash. */ > + OVS_PACKET_ATTR_UPCALL_PID, /* u32 Netlink PID. */ > __OVS_PACKET_ATTR_MAX > }; > > diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c > index f8850181d..7587c9c3e 100644 > --- a/lib/dpif-netlink.c > +++ b/lib/dpif-netlink.c > @@ -2021,6 +2021,10 @@ dpif_netlink_encode_execute(int dp_ifindex, const > struct dpif_execute *d_exec, if (d_exec->hash) { > nl_msg_put_u64(buf, OVS_PACKET_ATTR_HASH, d_exec->hash); > } > + > + if (d_exec->upcall_pid) { > + nl_msg_put_u32(buf, OVS_PACKET_ATTR_UPCALL_PID, > d_exec->upcall_pid); > + } > } > > /* Executes, against 'dpif', up to the first 'n_ops' operations in 'ops'. > @@ -2987,6 +2991,7 @@ dpif_netlink_recv_windows(struct dpif_netlink *dpif, > uint32_t handler_id, > error = parse_odp_packet(buf, upcall, &dp_ifindex); > if (!error && dp_ifindex == dpif->dp_ifindex) { > + upcall->pid = 0; > return 0; > } else if (error) { > return error; > @@ -3037,6 +3042,7 @@ dpif_netlink_recv_cpu_dispatch(struct dpif_netlink > *dpif, uint32_t handler_id, > error = parse_odp_packet(buf, upcall, &dp_ifindex); > if (!error && dp_ifindex == dpif->dp_ifindex) { > + upcall->pid = nl_sock_pid(handler->sock); > return 0; > } else if (error) { > return error; > @@ -3113,6 +3119,7 @@ dpif_netlink_recv_vport_dispatch(struct dpif_netlink > *dpif, > error = parse_odp_packet(buf, upcall, &dp_ifindex); > if (!error && dp_ifindex == dpif->dp_ifindex) { > + upcall->pid = nl_sock_pid(ch->sock); > return 0; > } else if (error) { > return error; > diff --git a/lib/dpif.h b/lib/dpif.h > index 6bef7d5b3..0b685d8df 100644 > --- a/lib/dpif.h > +++ b/lib/dpif.h > @@ -732,6 +732,8 @@ struct dpif_execute { > unsigned int mtu; /* Maximum transmission unit to > fragment. 0 if not a fragmented packet */ > uint64_t hash; /* Packet flow hash. 0 if not > specified. */ > + uint64_t upcall_pid; /* Netlink PID to use for upcalls. > + * 0 if not specified. */ > const struct flow *flow; /* Flow extracted from 'packet'. */ > > /* Input, but possibly modified as a side effect of execution. */ > @@ -833,6 +835,7 @@ struct dpif_upcall { > struct nlattr *mru; /* Maximum receive unit. */ > struct nlattr *hash; /* Packet hash. */ > struct nlattr *cutlen; /* Number of bytes shrink from the end. */ > + uint32_t pid; /* Socket PID the upcall was received > from. */ > /* DPIF_UC_ACTION only. */ > struct nlattr *userdata; /* Argument to OVS_ACTION_ATTR_USERSPACE. > */ diff --git a/ofproto/ofproto-dpif-upcall.c > b/ofproto/ofproto-dpif-upcall.c index 53e59580d..9dfa52d82 100644 > --- a/ofproto/ofproto-dpif-upcall.c > +++ b/ofproto/ofproto-dpif-upcall.c > @@ -230,6 +230,8 @@ struct upcall { > uint16_t mru; /* If !0, Maximum receive unit of > fragmented IP packet */ > uint64_t hash; > + uint32_t pid; /* Socket PID this upcall was received > from, > + * or zero. */ > > enum upcall_type type; /* Type of the upcall. */ > const struct nlattr *actions; /* Flow actions in DPIF_UC_ACTION > Upcalls. */ @@ -932,6 +934,7 @@ recv_upcalls(struct handler *handler) > upcall->key_len = dupcall->key_len; > upcall->ufid = &dupcall->ufid; > upcall->hash = hash; > + upcall->pid = dupcall->pid; > > upcall->out_tun_key = dupcall->out_tun_key; > upcall->actions = dupcall->actions; > @@ -1271,6 +1274,7 @@ upcall_receive(struct upcall *upcall, const struct > dpif_backer *backer, upcall->key = NULL; > upcall->key_len = 0; > upcall->mru = mru; > + upcall->pid = 0; > > upcall->out_tun_key = NULL; > upcall->actions = NULL; > @@ -1730,6 +1734,7 @@ handle_upcalls(struct udpif *udpif, struct upcall > *upcalls, op->dop.execute.probe = false; > op->dop.execute.mtu = upcall->mru; > op->dop.execute.hash = upcall->hash; > + op->dop.execute.upcall_pid = upcall->pid; > } > } > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev