On 10/22/25 3:42 PM, Frode Nordahl wrote:
> On 10/22/25 09:57, Frode Nordahl wrote:
>> On 10/20/25 11:59, Ilya Maximets wrote:
>>> On 10/18/25 11:22 AM, Frode Nordahl wrote:
>>>> On 7/8/25 13:34, Ilya Maximets wrote:
>>>>> When a packet enters kernel datapath and there is no flow to handle it,
>>>>> packet goes to userspace through a MISS upcall.  With per-CPU upcall
>>>>> dispatch mechanism, we're using the current CPU id to select the
>>>>> Netlink PID on which to send this packet.  This allows us to send
>>>>> packets from the same traffic flow through the same handler.
>>>>>
>>>>> The handler will process the packet, install required flow into the
>>>>> kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.
>>>>>
>>>>> While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
>>>>> recirculation action that will pass the (likely modified) packet
>>>>> through the flow lookup again.  And if the flow is not found, the
>>>>> packet will be sent to userspace again through another MISS upcall.
>>>>>
>>>>> However, the handler thread in userspace is likely running on a
>>>>> different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
>>>>> in the syscall context of that thread.  So, when the time comes to
>>>>> send the packet through another upcall, the per-CPU dispatch will
>>>>> choose a different Netlink PID, and this packet will end up processed
>>>>> by a different handler thread on a different CPU.
>>>>>
>>>>> The process continues as long as there are new recirculations, each
>>>>> time the packet goes to a different handler thread before it is sent
>>>>> out of the OVS datapath to the destination port.  In real setups the
>>>>> number of recirculations can go up to 4 or 5, sometimes more.
>>>>>
>>>>> There is always a chance to re-order packets while processing upcalls,
>>>>> because userspace will first install the flow and then re-inject the
>>>>> original packet.  So, there is a race window when the flow is already
>>>>> installed and the second packet can match it inside the kernel and be
>>>>> forwarded to the destination before the first packet is re-injected.
>>>>> But the fact that packets are going through multiple upcalls handled
>>>>> by different userspace threads makes the reordering noticeably more
>>>>> likely, because we not only have a race between the kernel and a
>>>>> userspace handler (which is hard to avoid), but also between multiple
>>>>> userspace handlers.
>>>>>
>>>>> For example, let's assume that 10 packets got enqueued through a MISS
>>>>> upcall for handler-1, it will start processing them, will install the
>>>>> flow into the kernel and start re-injecting packets back, from where
>>>>> they will go through another MISS to handler-2.  Handler-2 will install
>>>>> the flow into the kernel and start re-injecting the packets, while
>>>>> handler-1 continues to re-inject the last of the 10 packets, they will
>>>>> hit the flow installed by handler-2 and be forwarded without going to
>>>>> the handler-2, while handler-2 still re-injects the first of these 10
>>>>> packets.  Given multiple recirculations and misses, these 10 packets
>>>>> may end up completely mixed up on the output from the datapath.
>>>>>
>>>>> Let's provide the original upcall PID via the new ntlink attribute
>>>>> OVS_PACKET_ATTR_UPCALL_PID.  This way the upcall triggered during the
>>>>> execution will go to the same handler.  Packets will be enqueued to
>>>>> the same socket and re-injected in the same order.  This doesn't
>>>>> eliminate re-ordering as stated above, since we still have a race
>>>>> between the kernel and the handler thread, but it allows to eliminate
>>>>> races between multiple handlers.
>>>>>
>>>>> The openvswitch kernel module ignores unknown attributes for the
>>>>> OVS_PACKET_CMD_EXECUTE, so it's safe to provide it even on older
>>>>> kernels.
>>>>>
>>>>> Reported-at: https://issues.redhat.com/browse/FDP-1479
>>>>> Link: 
>>>>> https://lore.kernel.org/netdev/[email protected]/
>>>>> Signed-off-by: Ilya Maximets <[email protected]>
>>>>> ---
>>>>
>>>> Hello, Ilya,
>>>>
>>>> There appears to be a regression caused either by the kernel or userspace 
>>>> change on this subject.
>>>>
>>>> While investigating test failures in the OVN system test suite [0] 
>>>> starting with the 6.17 kernel, which includes "net: openvswitch: allow 
>>>> providing upcall pid for the 'execute' command", I found that reverting 
>>>> either the kernel commit or this OVS commit make the following OVN system 
>>>> tests pass again:
>>>>
>>>> "DNAT and SNAT on distributed router - N/S - IPv6"
>>>> "Traffic to router port via LLA"
>>>>
>>>> (Note that there are other tests discussed in that bug that break on 
>>>> different issue, these two are the ones I used in the latest validation 
>>>> steps before sending this e-mail).
>>>>
>>>> I have yet to dive into details of why this may be the case and I'll 
>>>> likely poke some more at this, but wanted to hear your immediate thoughts 
>>>> / insights on what could possibly be the issue.
>>>>
>>>> 0: https://launchpad.net/bugs/2127061
>>>>
>>>
>>> Hi, Frode.  I didn't try the tests, but if packet re-ordering is in place,
>>> I wonder if the root cause for these tests to fail is somewhat similar
>>> to the one described in OVS commit:
>>>    e0ee785b1a67 ("tests: system-traffic: Fix flaky floating IP test.")
>>>
>>> Providing the original upcall pid can make connection processing faster and
>>> cause the last ACK from a 3WHS to arrive very late and trigger RST reply.
>>> This reordering of last ACK was in OVS always or for a very long time, but
>>> it may be more visible with the rest of the traffic no longer reordered.
>>>
>>> restis capture is needed to confirm.
>>
>> Thanks for your pointer, retis was indeed enlightening.
>>
>> To me there appears to be something nefarious going on as the packets appear 
>> to be dropped because the kernel ends up attempting to transmit upcalls to a 
>> Netlink PID on which Open vSwitch is not listening [1].
>>
>> I'll try to track down from where, what appears to be a incorrect PID, might 
>> be coming from.
>>
>> 1: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/2127061/comments/3
>>
> 
> OK, I think I found something. The struct dpif_execute upcall_pid field may 
> be used uninitialized in dpif_execute_helper_cb().
> 
> This diff appears to fix this for me:
> diff --git a/lib/dpif.c b/lib/dpif.c
> index 070fc0131..a062a4179 100644
> --- a/lib/dpif.c
> +++ b/lib/dpif.c
> @@ -1240,6 +1240,7 @@ dpif_execute_helper_cb(void *aux_, struct 
> dp_packet_batch *packets_,
>           execute.probe = false;
>           execute.mtu = 0;
>           execute.hash = 0;
> +        execute.upcall_pid = 0;
>           aux->error = dpif_execute(aux->dpif, &execute);
>           log_execute_message(aux->dpif, &this_module, &execute,
>                               true, aux->error);
> 
> I'll do some further checks and propose a formal patch along these lines if 
> you have no objections or other suggestions.
> 

Hmm.  Good catch!  That definitely looks like a bug.
I wonder why asan/ubsan doesn't yell at us in our tests...

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to