Jason Baron via dev <ovs-dev@openvswitch.org> writes:

> On 12/10/19 5:20 PM, David Ahern wrote:
>> On 12/10/19 3:09 PM, Jason Baron wrote:
>>> Hi David,
>>>
>>> The idea is that we try and queue new work to 'idle' threads in an
>>> attempt to distribute a workload. Thus, once we find an 'idle' thread we
>>> stop waking up other threads. While we are searching the wakeup list for
>>> idle threads, we do queue an epoll event to the non-idle threads, this
>>> doesn't mean they are woken up. It just means that when they go to
>>> epoll_wait() to harvest events from the kernel, if the event is still
>>> available it will be reported. If the condition for the event is no
>>> longer true (because another thread consumed it), they the event
>>> wouldn't be visible. So its a way of load balancing a workload while
>>> also reducing the number of wakeups. Its 'exclusive' in the sense that
>>> it will stop after it finds the first idle thread.
>>>
>>> We certainly can employ other wakeup strategies - there was interest
>>> (and patches) for a strict 'round robin' but that has not been merged
>>> upstream.
>>>
>>> I would like to better understand the current usecase. It sounds like
>>> each thread as an epoll file descriptor. And each epoll file descriptor
>>> is attached the name netlink socket. But when that netlink socket gets a
>>> packet it causes all the threads to wakeup? Are you sure there is just 1
>>> netlink socket that all epoll file desciptors are are attached to?
>>>
>> 
>> Thanks for the response.
>> 
>> This is the code in question:
>> 
>> https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492
>> 
>> Yes, prior to finding the above code reference I had traced it to a
>> single socket with all handler threads (71 threads on this 96 cpu box)
>> on the wait queue.
>> 
>> The ovs kernel module is punting a packet to userspace. It generates a
>> netlink message and invokes netlink_unicast. This the stack trace:
>> 
>>         ffffffffad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms])
>>         ffffffffad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms])
>>         ffffffffad257275 pollwake+0x75 ([kernel.kallsyms])
>>         ffffffffad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
>>         ffffffffad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
>>         ffffffffad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms])
>>         ffffffffad28a4bc ep_call_nested.constprop.18+0xbc
>> ([kernel.kallsyms])
>>         ffffffffad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms])
>>         ffffffffad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
>>         ffffffffad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
>>         ffffffffad794af9 sock_def_readable+0x39 ([kernel.kallsyms])
>>         ffffffffad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms])
>>         ffffffffad7eb11a netlink_unicast+0x20a ([kernel.kallsyms])
>>         ffffffffc07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms])
>>         ffffffffc07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms])
>> 
>> 
>> A probe on sock_def_readable shows it is a single socket that the wait
>> queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once
>> for each thread). In some cases I see the awakened threads immediately
>> running on the target_cpu, while the queue walk continues to wake up all
>> threads. Only the first one is going to handle the packet so the rest of
>> the wakeups are just noise.
>> 
>> On this system in just a 1-second interval I see this sequence play out
>> 400+ times.
>> 
>
>
> That stack trace is interesting. I *think* it means you are doing
> select() or poll() on the epoll file descriptors? I say that because I
> see 'pollwake()' in the stack. Notice that pollwake() is only in
> fs/select.c.

During ofproto upcall processing, if there aren't any upcalls to
receive, a dpif_recv_wait() is called, which will call into
dpif_netlink_recv_wait() which will set up the next poll_block() call.

This eventually uses the poll() interface (through time_poll() call) .

Maybe we can confirm the thundering herd by just making a quick hack to
shunt out the poll block (included).  Just a quick 'n dirty hack to test
with.

> Thanks,
>
> -jason

NOTE: this change will trigger warnings about unused parameters.  But it
      will prevent us from registering on the poll_loop for the thread.
      Keep in mind there will be high cpu utilization since the
      poll_block will never sleep (and this is a hack... so there's
      that)

diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index d1f9b81db..804373c3a 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -2709,7 +2709,7 @@ dpif_netlink_recv_wait__(struct dpif_netlink *dpif, 
uint32_t handler_id)
     if (dpif->handlers && handler_id < dpif->n_handlers) {
         struct dpif_handler *handler = &dpif->handlers[handler_id];
 
-        poll_fd_wait(handler->epoll_fd, POLLIN);
+        poll_immediate_wake(); /* now we never sleep or poll the handler */
     }
 #endif
 }

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to