Flavio Leitner <[email protected]> writes: > On Wed, Jun 24, 2020 at 09:48:00AM -0400, Aaron Conole wrote: >> Currently, the channel handlers are polled globally. On some >> systems, this causes a thundering herd issue where multiple >> handler threads become active, only to do no work and immediately >> sleep. >> >> The approach here is to push the netlink socket channels to discreet >> handler threads to process, rather than polling on every thread. >> This will eliminate the need to wake multiple threads. >> >> To check: >> >> ip netns add left >> ip netns add right >> ip link add center-left type veth peer name left0 >> ip link add center-right type veth peer name right0 >> ip link set left0 netns left >> ip link set right0 netns right >> ip link set center-left up >> ip link set center-right up >> ip netns exec left ip link set left0 up >> ip netns exec left ip addr add 172.31.110.10/24 dev left0 >> ip netns exec right ip link set right0 up >> ip netns exec right ip addr add 172.31.110.11/24 dev right0 >> >> ovs-vsctl add-br br0 >> ovs-vsctl add-port br0 center-right >> ovs-vsctl add-port br0 center-left >> >> # in one terminal >> perf record -e sched:sched_wakeup,irq:softirq_entry -ag >> >> # in a separate terminal >> ip netns exec left arping -I left0 -c 1 172.31.110.11 >> >> # in the perf terminal after exiting >> perf script >> >> Look for the number of 'handler' threads which were made active. > > I am surprised to find out that in 2020 and Linux doesn't provide > a reasonable solution for the thundering herd issue. I wonder how > many applications have this issue out there. It looks like the > devil in is the details.
It turns out that it's complicated. And the way OVS is structured, simply switching the poll loop to epoll() isn't as straightforward - epoll() would require a completely new design for the way FDs are polled. And then it wouldn't fit with other systems (like Windows or *BSD), and we'd need to implement things quite differently. It's very intrusive and painful to get right. > Anyways, I think the previous model didn't work because the number > of attached ports is increasing and so is the number of cores. The > number of sockets escalates quickly and it had the packet re-ordering > issue. Yes - that's true. We would do n_ports * n_cores sockets. Now we just do n_ports sockets. > Moving forward, we moved to one socket per port with all threads > polling the same socket in the hope that the kernel would only wake > one or very few threads. Looks like that doesn't work and the packet > re-ordering is still an issue. Yes - because what happens is the first level epoll() may not trigger, then the threads go into the deep poll_wait(), which falls back to the poll() systemcall (which uses the older semantics and introduces thundering herd - see the discussion linked). Then when a packet comes in to trigger, kernel will start waking all the threads. > Therefore this approach where there is one socket per port but only > thread polling would solve the thundering herd issue and the packet > re-ordering issue. > > One concern is with load capacity, since now only one thread would > be responsible to process a port, but I think that's what we had > initially, correct? So, it's not an issue. Yes. Actually, I think it's might be better this way - if every packet for a port is getting slowpathed, we will keep all the port processing to the same core, so no more out-of-order (which may have performance impacts on TCP) and since it's slowpath, performance won't really matter that much anyway. > It looks like now it is allocating enough memory for all port_idx > per thread even though it will not use all of them. Perhaps that > can be fixed? For instance, instead of indexing the array with > port_idx, use handler->port_idx as an identifier? Yes, that's why I posted as RFC. I plan to clean up the API / structure a little bit this week after finishing going through my email from PTO. > More inline > >> Reported-by: David Ahern <[email protected]> >> Reported-at: >> https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html >> Cc: Matteo Croce <[email protected]> >> Cc: Flavio Leitner <[email protected]> >> Signed-off-by: Aaron Conole <[email protected]> >> --- >> I haven't finished all my testing, but I wanted to send this >> for early feedback in case I went completely off course. >> >> lib/dpif-netlink.c | 190 ++++++++++++++++++++++++++------------------- >> 1 file changed, 112 insertions(+), 78 deletions(-) >> >> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c >> index 1817e9f849..d59375cf00 100644 >> --- a/lib/dpif-netlink.c >> +++ b/lib/dpif-netlink.c >> @@ -179,9 +179,11 @@ struct dpif_windows_vport_sock { >> >> struct dpif_handler { >> struct epoll_event *epoll_events; >> - int epoll_fd; /* epoll fd that includes channel socks. >> */ >> - int n_events; /* Num events returned by epoll_wait(). */ >> - int event_offset; /* Offset into 'epoll_events'. */ >> + struct dpif_channel *channels; /* Array of channels for each port. */ >> + size_t n_channels; /* Count of channels for each port. */ >> + int epoll_fd; /* epoll fd that includes channel socks. >> */ >> + int n_events; /* Num events returned by epoll_wait(). >> */ >> + int event_offset; /* Offset into 'epoll_events'. */ >> >> #ifdef _WIN32 >> /* Pool of sockets. */ >> @@ -201,7 +203,6 @@ struct dpif_netlink { >> struct fat_rwlock upcall_lock; >> struct dpif_handler *handlers; >> uint32_t n_handlers; /* Num of upcall handlers. */ >> - struct dpif_channel *channels; /* Array of channels for each port. */ >> int uc_array_size; /* Size of 'handler->channels' and */ >> /* 'handler->epoll_events'. */ >> >> @@ -452,15 +453,22 @@ static bool >> vport_get_pid(struct dpif_netlink *dpif, uint32_t port_idx, >> uint32_t *upcall_pid) >> { >> - /* Since the nl_sock can only be assigned in either all >> - * or none "dpif" channels, the following check >> - * would suffice. */ >> - if (!dpif->channels[port_idx].sock) { >> - return false; >> - } >> + size_t i; >> ovs_assert(!WINDOWS || dpif->n_handlers <= 1); >> >> - *upcall_pid = nl_sock_pid(dpif->channels[port_idx].sock); >> + /* The 'port_idx' should only be valid for a single handler. */ >> + for (i = 0; i < dpif->n_handlers; ++i) { >> + >> + if (port_idx < dpif->handlers[i].n_channels && >> + dpif->handlers[i].channels[port_idx].sock) { >> + *upcall_pid = >> + nl_sock_pid(dpif->handlers[i].channels[port_idx].sock); >> + break; >> + } >> + } > > > Perhaps > > /* The 'port_idx' should only be valid for a single handler. */ > - for (i = 0; i < dpif->n_handlers; ++i) { > + for (i = 0; i < dpif->n_handlers; i++) { You don't like pre-increment operator? It's an old habit from writing lots of c++ :-) > + struct dpif_handler *handler = &dpif->handlers[i]; > > - if (port_idx < dpif->handlers[i].n_channels && > - dpif->handlers[i].channels[port_idx].sock) { > + if (port_idx < handler->n_channels && > + handler->channels[port_idx].sock) { > *upcall_pid = > - nl_sock_pid(dpif->handlers[i].channels[port_idx].sock); > + nl_sock_pid(handler->channels[port_idx].sock); > break; > } > } > > the same for similar places below. I plan on replacing all the open-coded loops with a new FOR_EACH Thanks for the review, Flavio! > >> + >> + if (i == dpif->n_handlers) >> + return false; >> >> return true; >> } >> @@ -473,15 +481,23 @@ vport_add_channel(struct dpif_netlink *dpif, >> odp_port_t port_no, >> uint32_t port_idx = odp_to_u32(port_no); >> size_t i; >> int error; >> + struct dpif_handler *handler = NULL; >> >> if (dpif->handlers == NULL) { >> close_nl_sock(sock); >> return 0; >> } >> >> + /* choose a handler by finding the least populated handler. */ >> + handler = &dpif->handlers[0]; >> + for (i = 0; i < dpif->n_handlers; ++i) { >> + if (dpif->handlers[i].n_channels < handler->n_channels) >> + handler = &dpif->handlers[i]; >> + } >> + >> /* We assume that the datapath densely chooses port numbers, which can >> * therefore be used as an index into 'channels' and 'epoll_events' of >> - * 'dpif'. */ >> + * 'dpif->handler'. */ >> if (port_idx >= dpif->uc_array_size) { >> uint32_t new_size = port_idx + 1; >> >> @@ -491,40 +507,33 @@ vport_add_channel(struct dpif_netlink *dpif, >> odp_port_t port_no, >> return EFBIG; >> } >> >> - dpif->channels = xrealloc(dpif->channels, >> - new_size * sizeof *dpif->channels); >> + handler->channels = xrealloc(handler->channels, >> + new_size * sizeof *handler->channels); >> >> - for (i = dpif->uc_array_size; i < new_size; i++) { >> - dpif->channels[i].sock = NULL; >> + for (i = handler->n_channels; i < new_size; i++) { >> + handler->channels[i].sock = NULL; >> } >> >> - for (i = 0; i < dpif->n_handlers; i++) { >> - struct dpif_handler *handler = &dpif->handlers[i]; >> - >> - handler->epoll_events = xrealloc(handler->epoll_events, >> - new_size * sizeof *handler->epoll_events); >> - >> - } >> + handler->epoll_events = xrealloc(handler->epoll_events, >> + new_size * sizeof >> *handler->epoll_events); >> dpif->uc_array_size = new_size; >> + handler->n_channels = new_size; >> } >> >> memset(&event, 0, sizeof event); >> event.events = EPOLLIN | EPOLLEXCLUSIVE; >> event.data.u32 = port_idx; >> >> - for (i = 0; i < dpif->n_handlers; i++) { >> - struct dpif_handler *handler = &dpif->handlers[i]; >> - >> #ifndef _WIN32 >> - if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock), >> - &event) < 0) { >> - error = errno; >> - goto error; >> - } >> -#endif >> + if (epoll_ctl(handler->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(sock), >> + &event) < 0) { >> + error = errno; >> + goto error; >> } >> - dpif->channels[port_idx].sock = sock; >> - dpif->channels[port_idx].last_poll = LLONG_MIN; >> +#endif >> + >> + handler->channels[port_idx].sock = sock; >> + handler->channels[port_idx].last_poll = LLONG_MIN; >> >> return 0; >> >> @@ -535,34 +544,53 @@ error: >> nl_sock_fd(sock), NULL); >> } >> #endif >> - dpif->channels[port_idx].sock = NULL; >> + handler->channels[port_idx].sock = NULL; >> >> return error; >> } >> >> +static void >> +vport_del_channel(struct dpif_handler *handler, uint32_t port_idx) >> +{ >> + if (port_idx < handler->n_channels && >> + handler->channels[port_idx].sock) { >> +#ifndef _WIN32 >> + epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL, >> + nl_sock_fd(handler->channels[port_idx].sock), NULL); >> +#endif >> + handler->event_offset = handler->n_events = 0; >> + >> +#ifndef _WIN32 >> + nl_sock_destroy(handler->channels[port_idx].sock); >> +#endif >> + handler->channels[port_idx].sock = NULL; >> + } >> +} >> + >> static void >> vport_del_channels(struct dpif_netlink *dpif, odp_port_t port_no) >> { >> uint32_t port_idx = odp_to_u32(port_no); >> + struct dpif_handler *handler = NULL; >> size_t i; >> >> - if (!dpif->handlers || port_idx >= dpif->uc_array_size >> - || !dpif->channels[port_idx].sock) { >> + if (!dpif->handlers || port_idx >= dpif->uc_array_size) { >> return; >> } >> >> for (i = 0; i < dpif->n_handlers; i++) { >> - struct dpif_handler *handler = &dpif->handlers[i]; >> -#ifndef _WIN32 >> - epoll_ctl(handler->epoll_fd, EPOLL_CTL_DEL, >> - nl_sock_fd(dpif->channels[port_idx].sock), NULL); >> -#endif >> - handler->event_offset = handler->n_events = 0; >> + handler = &dpif->handlers[i]; >> + >> + if (port_idx >= handler->n_channels || >> + !handler->channels[port_idx].sock) { >> + handler = NULL; >> + continue; >> + } >> + } >> + >> + if (handler) { >> + vport_del_channel(handler, port_idx); >> } >> -#ifndef _WIN32 >> - nl_sock_destroy(dpif->channels[port_idx].sock); >> -#endif >> - dpif->channels[port_idx].sock = NULL; >> } >> >> static void >> @@ -575,36 +603,38 @@ destroy_all_channels(struct dpif_netlink *dpif) >> return; >> } >> >> - for (i = 0; i < dpif->uc_array_size; i++ ) { >> - struct dpif_netlink_vport vport_request; >> - uint32_t upcall_pids = 0; >> - >> - if (!dpif->channels[i].sock) { >> - continue; >> - } >> + for (i = 0; i < dpif->n_handlers; i++) { >> + struct dpif_handler *handler = &dpif->handlers[i]; >> + size_t j; >> >> - /* Turn off upcalls. */ >> - dpif_netlink_vport_init(&vport_request); >> - vport_request.cmd = OVS_VPORT_CMD_SET; >> - vport_request.dp_ifindex = dpif->dp_ifindex; >> - vport_request.port_no = u32_to_odp(i); >> - vport_request.n_upcall_pids = 1; >> - vport_request.upcall_pids = &upcall_pids; >> - dpif_netlink_vport_transact(&vport_request, NULL, NULL); >> + for (j = 0; j < handler->n_channels; j++ ) { >> + struct dpif_netlink_vport vport_request; >> + uint32_t upcall_pids = 0; >> >> - vport_del_channels(dpif, u32_to_odp(i)); >> - } >> + if (!handler->channels[j].sock) { >> + continue; >> + } >> >> - for (i = 0; i < dpif->n_handlers; i++) { >> - struct dpif_handler *handler = &dpif->handlers[i]; >> + /* Turn off upcalls. */ >> + dpif_netlink_vport_init(&vport_request); >> + vport_request.cmd = OVS_VPORT_CMD_SET; >> + vport_request.dp_ifindex = dpif->dp_ifindex; >> + vport_request.port_no = u32_to_odp(j); >> + vport_request.n_upcall_pids = 1; >> + vport_request.upcall_pids = &upcall_pids; >> + dpif_netlink_vport_transact(&vport_request, NULL, NULL); >> >> + vport_del_channel(handler, j); >> + } >> + handler->n_channels = 0; >> + free(handler->channels); >> + handler->channels = NULL; >> dpif_netlink_handler_uninit(handler); >> free(handler->epoll_events); >> + handler->epoll_events = NULL; >> } >> - free(dpif->channels); >> free(dpif->handlers); >> dpif->handlers = NULL; >> - dpif->channels = NULL; >> dpif->n_handlers = 0; >> dpif->uc_array_size = 0; >> } >> @@ -1091,13 +1121,17 @@ dpif_netlink_port_get_pid__(const struct >> dpif_netlink *dpif, >> /* The ODPP_NONE "reserved" port number uses the "ovs-system"'s >> * channel, since it is not heavily loaded. */ >> uint32_t idx = port_idx >= dpif->uc_array_size ? 0 : port_idx; >> + size_t i; >> >> /* Needs to check in case the socket pointer is changed in between >> * the holding of upcall_lock. A known case happens when the main >> * thread deletes the vport while the handler thread is handling >> * the upcall from that port. */ >> - if (dpif->channels[idx].sock) { >> - pid = nl_sock_pid(dpif->channels[idx].sock); >> + for (i = 0; i < dpif->n_handlers; ++i) { >> + if (idx < dpif->handlers[i].n_channels && >> + dpif->handlers[i].channels[idx].sock) { >> + pid = nl_sock_pid(dpif->handlers[i].channels[idx].sock); >> + } >> } >> } >> >> @@ -2738,7 +2772,7 @@ dpif_netlink_recv__(struct dpif_netlink *dpif, >> uint32_t handler_id, >> >> while (handler->event_offset < handler->n_events) { >> int idx = handler->epoll_events[handler->event_offset].data.u32; >> - struct dpif_channel *ch = &dpif->channels[idx]; >> + struct dpif_channel *ch = &handler->channels[idx]; >> >> handler->event_offset++; >> >> @@ -2840,14 +2874,14 @@ dpif_netlink_recv_purge__(struct dpif_netlink *dpif) >> OVS_REQ_WRLOCK(dpif->upcall_lock) >> { >> if (dpif->handlers) { >> - size_t i; >> + size_t i, j; >> >> - if (!dpif->channels[0].sock) { >> - return; >> - } >> - for (i = 0; i < dpif->uc_array_size; i++ ) { >> - >> - nl_sock_drain(dpif->channels[i].sock); >> + for (j = 0; j < dpif->n_handlers; ++j) { >> + for (i = 0; i < dpif->handlers[j].n_channels; i++ ) { >> + if (dpif->handlers[j].channels[i].sock) { >> + nl_sock_drain(dpif->handlers[j].channels[i].sock); >> + } >> + } >> } >> } >> } >> -- >> 2.25.1 >> >> _______________________________________________ >> dev mailing list >> [email protected] >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
