On Tue, Dec 10, 2019 at 10:00 PM David Ahern <dsah...@gmail.com> wrote: > > [ adding Jason as author of the patch that added the epoll exclusive flag ] > > On 12/10/19 12:37 PM, Matteo Croce wrote: > > On Tue, Dec 10, 2019 at 8:13 PM David Ahern <dsah...@gmail.com> wrote: > >> > >> Hi Matteo: > >> > >> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a > >> thundering herd wake up problem. Every packet punted to userspace wakes > >> up every one of the handler threads. On a box with 96 cpus, there are 71 > >> handler threads which means 71 process wakeups for every packet punted. > >> > >> This is really easy to see, just watch sched:sched_wakeup tracepoints. > >> With a few extra probes: > >> > >> perf probe sock_def_readable sk=%di > >> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx > >> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx > >> wake_flags=%cx key=%8 > >> > >> you can see there is a single netlink socket and its wait queue contains > >> an entry for every handler thread. > >> > >> This does not happen with the 2.7.3 version. Roaming commits it appears > >> that the change in behavior comes from this commit: > >> > >> commit 69c51582ff786a68fc325c1c50624715482bc460 > >> Author: Matteo Croce <mcr...@redhat.com> > >> Date: Tue Sep 25 10:51:05 2018 +0200 > >> > >> dpif-netlink: don't allocate per thread netlink sockets > >> > >> > >> Is this a known problem? > >> > >> David > >> > > > > Hi David, > > > > before my patch, vswitchd created NxM sockets, being N the ports and M > > the active cores, > > because every thread opens a netlink socket per port. > > > > With my patch, a pool is created with N socket, one per port, and all > > the threads polls the same list > > with the EPOLLEXCLUSIVE flag. > > As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one > > of the waiting threads. > > > > I'm not aware of this problem, but it goes against the intended > > behaviour of EPOLLEXCLUSIVE. > > Such flag exists since Linux 4.5, can you check that it's passed > > correctly to epoll()? > > > > This the commit that added the EXCLUSIVE flag: > > commit df0108c5da561c66c333bb46bfe3c1fc65905898 > Author: Jason Baron <jba...@akamai.com> > Date: Wed Jan 20 14:59:24 2016 -0800 > > epoll: add EPOLLEXCLUSIVE flag > > > The commit message acknowledges that multiple threads can still be awakened: > > "The implementation walks the list of exclusive waiters, and queues an > event to each epfd, until it finds the first waiter that has threads > blocked on it via epoll_wait(). The idea is to search for threads which > are idle and ready to process the wakeup events. Thus, we queue an > event to at least 1 epfd, but may still potentially queue an event to > all epfds that are attached to the shared fd source." > > To me that means all idle handler threads are going to be awakened on > each upcall message even though only 1 is needed to handle the message. > > Jason: What was the rationale behind the exclusive flag that still wakes > up more than 1 waiter? In the case of OVS and vswitchd I am seeing all N > handler threads awakened on every single event which is a horrible > scaling property. >
Actually, I didn't look at that commit message, but I read the epoll_ctl manpage which says: "When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using EPOLLEXCLUSIVE, one or more of the epoll file descriptors will receive an event with epoll_wait(2). The default in this scenario (when EPOLLEXCLUSIVE is not set) is for all epoll file descriptors to receive an event. EPOLLEXCLUSIVE is thus useful for avoidâ ing thundering herd problems in certain scenarios." I'd expect "one or more" to be probably greater than 1, but still much lower than all. Before this patch (which unfortunately is needed to avoid -EMFILE errors with many ports), how many sockets are awakened when an ARP is received? Regards, -- Matteo Croce per aspera ad upstream _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev