> Am 08.07.2021 um 14:14 schrieb Yann Ylavic <ylavic....@gmail.com>: > > On Thu, Jul 8, 2021 at 11:47 AM Stefan Eissing > <stefan.eiss...@greenbytes.de> wrote: >> >> Some day, I knew I had to learn more about mpm_event. =) >> >> Adding more DEBUGs, I see in the example below that 2 connections were >> idling at start of the graceful and the get added to the linger_chain. 2 >> workers are then woken up and process the socket. connection_count stays at >> 2 however. As I read it, that count drops when the connection pool is >> destroyed/cleanup'ed. This normally seem to happen on the worker_queue_info, >> but in this example this just does not happen. >> >> Is this a correct read? > > You proxy to a local server, so the 2 connections are the incoming > ones on the local proxy vhost and the local server vhost. > But mod_proxy backend connections are recycled and end up in a reslist > which is not checked until reuse. > So when on graceful restart the local server starts lingering close > with its kept alive connection, it's just ignored by mod_proxy and the > MAX_SECS_TO_LINGER timeout (30s) applies. > > I think we shouldn't do lingering on keepalive connections when they > expire or get killed (graceful or max workers), this does not help the > client anyway because any data sent on the connection is doomed, the > sooner we RESET the faster it will try on another connection. > > So how about the attached patch that closes the connections when their > keepalive expires (including when it's shortened)? > There are other changes in there, like more trace logs that helped me > debug things, they are worth it too IMHO. > Also I changed reqevents = POLLIN to POLLIN|POLLHUP (by generalizing > and using update_reqevents_from_sense() in more places), because I'm > not sure that we catch the connections closed remotely otherwise > (connection_count should go down to 1 more quickly in the local > proxy+server case because the local server's client connection should > have responded to the lingering close almost instantly, but this one > seems to timeout too). > > Cheers; > Yann. > <event_ka_no_linger.diff>
This seems to be it! Yann strikes again! And I learned something...\o/ I needed to make small tweaks, because no all previous close checked the return value and I got assert failures on your patch. I attach my version below. Instead of the param, there could also be 2 functions, one without checks. A matter of taste. So, I try to summarise: - the effect was triggered by the keep-alive connections remote endpoints slumbering in a proxy connection pool where no one was monitoring the sockets and reacting to TCP FIN packets. - This triggered a long LINGER timeout that blocked the listener from exiting. Ultimately leading to the parent killing the child. - This situation could also happen, I assume, when a client drops from the network, e.g. a cell phone entering bad coverage, where no one in the network path really knows what to do? Or a network connectivity issue or a comp crashing behind a NAT? Cheers, Stefan
event_ka_no_lingerv2.diff
Description: Binary data