On 10/22/25 3:42 PM, Frode Nordahl wrote: > On 10/22/25 09:57, Frode Nordahl wrote: >> On 10/20/25 11:59, Ilya Maximets wrote: >>> On 10/18/25 11:22 AM, Frode Nordahl wrote: >>>> On 7/8/25 13:34, Ilya Maximets wrote: >>>>> When a packet enters kernel datapath and there is no flow to handle it, >>>>> packet goes to userspace through a MISS upcall. With per-CPU upcall >>>>> dispatch mechanism, we're using the current CPU id to select the >>>>> Netlink PID on which to send this packet. This allows us to send >>>>> packets from the same traffic flow through the same handler. >>>>> >>>>> The handler will process the packet, install required flow into the >>>>> kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE. >>>>> >>>>> While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a >>>>> recirculation action that will pass the (likely modified) packet >>>>> through the flow lookup again. And if the flow is not found, the >>>>> packet will be sent to userspace again through another MISS upcall. >>>>> >>>>> However, the handler thread in userspace is likely running on a >>>>> different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled >>>>> in the syscall context of that thread. So, when the time comes to >>>>> send the packet through another upcall, the per-CPU dispatch will >>>>> choose a different Netlink PID, and this packet will end up processed >>>>> by a different handler thread on a different CPU. >>>>> >>>>> The process continues as long as there are new recirculations, each >>>>> time the packet goes to a different handler thread before it is sent >>>>> out of the OVS datapath to the destination port. In real setups the >>>>> number of recirculations can go up to 4 or 5, sometimes more. >>>>> >>>>> There is always a chance to re-order packets while processing upcalls, >>>>> because userspace will first install the flow and then re-inject the >>>>> original packet. So, there is a race window when the flow is already >>>>> installed and the second packet can match it inside the kernel and be >>>>> forwarded to the destination before the first packet is re-injected. >>>>> But the fact that packets are going through multiple upcalls handled >>>>> by different userspace threads makes the reordering noticeably more >>>>> likely, because we not only have a race between the kernel and a >>>>> userspace handler (which is hard to avoid), but also between multiple >>>>> userspace handlers. >>>>> >>>>> For example, let's assume that 10 packets got enqueued through a MISS >>>>> upcall for handler-1, it will start processing them, will install the >>>>> flow into the kernel and start re-injecting packets back, from where >>>>> they will go through another MISS to handler-2. Handler-2 will install >>>>> the flow into the kernel and start re-injecting the packets, while >>>>> handler-1 continues to re-inject the last of the 10 packets, they will >>>>> hit the flow installed by handler-2 and be forwarded without going to >>>>> the handler-2, while handler-2 still re-injects the first of these 10 >>>>> packets. Given multiple recirculations and misses, these 10 packets >>>>> may end up completely mixed up on the output from the datapath. >>>>> >>>>> Let's provide the original upcall PID via the new ntlink attribute >>>>> OVS_PACKET_ATTR_UPCALL_PID. This way the upcall triggered during the >>>>> execution will go to the same handler. Packets will be enqueued to >>>>> the same socket and re-injected in the same order. This doesn't >>>>> eliminate re-ordering as stated above, since we still have a race >>>>> between the kernel and the handler thread, but it allows to eliminate >>>>> races between multiple handlers. >>>>> >>>>> The openvswitch kernel module ignores unknown attributes for the >>>>> OVS_PACKET_CMD_EXECUTE, so it's safe to provide it even on older >>>>> kernels. >>>>> >>>>> Reported-at: https://issues.redhat.com/browse/FDP-1479 >>>>> Link: >>>>> https://lore.kernel.org/netdev/[email protected]/ >>>>> Signed-off-by: Ilya Maximets <[email protected]> >>>>> --- >>>> >>>> Hello, Ilya, >>>> >>>> There appears to be a regression caused either by the kernel or userspace >>>> change on this subject. >>>> >>>> While investigating test failures in the OVN system test suite [0] >>>> starting with the 6.17 kernel, which includes "net: openvswitch: allow >>>> providing upcall pid for the 'execute' command", I found that reverting >>>> either the kernel commit or this OVS commit make the following OVN system >>>> tests pass again: >>>> >>>> "DNAT and SNAT on distributed router - N/S - IPv6" >>>> "Traffic to router port via LLA" >>>> >>>> (Note that there are other tests discussed in that bug that break on >>>> different issue, these two are the ones I used in the latest validation >>>> steps before sending this e-mail). >>>> >>>> I have yet to dive into details of why this may be the case and I'll >>>> likely poke some more at this, but wanted to hear your immediate thoughts >>>> / insights on what could possibly be the issue. >>>> >>>> 0: https://launchpad.net/bugs/2127061 >>>> >>> >>> Hi, Frode. I didn't try the tests, but if packet re-ordering is in place, >>> I wonder if the root cause for these tests to fail is somewhat similar >>> to the one described in OVS commit: >>> e0ee785b1a67 ("tests: system-traffic: Fix flaky floating IP test.") >>> >>> Providing the original upcall pid can make connection processing faster and >>> cause the last ACK from a 3WHS to arrive very late and trigger RST reply. >>> This reordering of last ACK was in OVS always or for a very long time, but >>> it may be more visible with the rest of the traffic no longer reordered. >>> >>> restis capture is needed to confirm. >> >> Thanks for your pointer, retis was indeed enlightening. >> >> To me there appears to be something nefarious going on as the packets appear >> to be dropped because the kernel ends up attempting to transmit upcalls to a >> Netlink PID on which Open vSwitch is not listening [1]. >> >> I'll try to track down from where, what appears to be a incorrect PID, might >> be coming from. >> >> 1: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/2127061/comments/3 >> > > OK, I think I found something. The struct dpif_execute upcall_pid field may > be used uninitialized in dpif_execute_helper_cb(). > > This diff appears to fix this for me: > diff --git a/lib/dpif.c b/lib/dpif.c > index 070fc0131..a062a4179 100644 > --- a/lib/dpif.c > +++ b/lib/dpif.c > @@ -1240,6 +1240,7 @@ dpif_execute_helper_cb(void *aux_, struct > dp_packet_batch *packets_, > execute.probe = false; > execute.mtu = 0; > execute.hash = 0; > + execute.upcall_pid = 0; > aux->error = dpif_execute(aux->dpif, &execute); > log_execute_message(aux->dpif, &this_module, &execute, > true, aux->error); > > I'll do some further checks and propose a formal patch along these lines if > you have no objections or other suggestions. >
Hmm. Good catch! That definitely looks like a bug. I wonder why asan/ubsan doesn't yell at us in our tests... Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
