Jason Baron via dev <ovs-dev@openvswitch.org> writes: > On 12/10/19 5:20 PM, David Ahern wrote: >> On 12/10/19 3:09 PM, Jason Baron wrote: >>> Hi David, >>> >>> The idea is that we try and queue new work to 'idle' threads in an >>> attempt to distribute a workload. Thus, once we find an 'idle' thread we >>> stop waking up other threads. While we are searching the wakeup list for >>> idle threads, we do queue an epoll event to the non-idle threads, this >>> doesn't mean they are woken up. It just means that when they go to >>> epoll_wait() to harvest events from the kernel, if the event is still >>> available it will be reported. If the condition for the event is no >>> longer true (because another thread consumed it), they the event >>> wouldn't be visible. So its a way of load balancing a workload while >>> also reducing the number of wakeups. Its 'exclusive' in the sense that >>> it will stop after it finds the first idle thread. >>> >>> We certainly can employ other wakeup strategies - there was interest >>> (and patches) for a strict 'round robin' but that has not been merged >>> upstream. >>> >>> I would like to better understand the current usecase. It sounds like >>> each thread as an epoll file descriptor. And each epoll file descriptor >>> is attached the name netlink socket. But when that netlink socket gets a >>> packet it causes all the threads to wakeup? Are you sure there is just 1 >>> netlink socket that all epoll file desciptors are are attached to? >>> >> >> Thanks for the response. >> >> This is the code in question: >> >> https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492 >> >> Yes, prior to finding the above code reference I had traced it to a >> single socket with all handler threads (71 threads on this 96 cpu box) >> on the wait queue. >> >> The ovs kernel module is punting a packet to userspace. It generates a >> netlink message and invokes netlink_unicast. This the stack trace: >> >> ffffffffad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms]) >> ffffffffad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms]) >> ffffffffad257275 pollwake+0x75 ([kernel.kallsyms]) >> ffffffffad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) >> ffffffffad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) >> ffffffffad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms]) >> ffffffffad28a4bc ep_call_nested.constprop.18+0xbc >> ([kernel.kallsyms]) >> ffffffffad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms]) >> ffffffffad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms]) >> ffffffffad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms]) >> ffffffffad794af9 sock_def_readable+0x39 ([kernel.kallsyms]) >> ffffffffad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms]) >> ffffffffad7eb11a netlink_unicast+0x20a ([kernel.kallsyms]) >> ffffffffc07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms]) >> ffffffffc07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms]) >> >> >> A probe on sock_def_readable shows it is a single socket that the wait >> queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once >> for each thread). In some cases I see the awakened threads immediately >> running on the target_cpu, while the queue walk continues to wake up all >> threads. Only the first one is going to handle the packet so the rest of >> the wakeups are just noise. >> >> On this system in just a 1-second interval I see this sequence play out >> 400+ times. >> > > > That stack trace is interesting. I *think* it means you are doing > select() or poll() on the epoll file descriptors? I say that because I > see 'pollwake()' in the stack. Notice that pollwake() is only in > fs/select.c.
During ofproto upcall processing, if there aren't any upcalls to receive, a dpif_recv_wait() is called, which will call into dpif_netlink_recv_wait() which will set up the next poll_block() call. This eventually uses the poll() interface (through time_poll() call) . Maybe we can confirm the thundering herd by just making a quick hack to shunt out the poll block (included). Just a quick 'n dirty hack to test with. > Thanks, > > -jason NOTE: this change will trigger warnings about unused parameters. But it will prevent us from registering on the poll_loop for the thread. Keep in mind there will be high cpu utilization since the poll_block will never sleep (and this is a hack... so there's that) diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c index d1f9b81db..804373c3a 100644 --- a/lib/dpif-netlink.c +++ b/lib/dpif-netlink.c @@ -2709,7 +2709,7 @@ dpif_netlink_recv_wait__(struct dpif_netlink *dpif, uint32_t handler_id) if (dpif->handlers && handler_id < dpif->n_handlers) { struct dpif_handler *handler = &dpif->handlers[handler_id]; - poll_fd_wait(handler->epoll_fd, POLLIN); + poll_immediate_wake(); /* now we never sleep or poll the handler */ } #endif } _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev