> Am 08.07.2021 um 14:14 schrieb Yann Ylavic <ylavic....@gmail.com>:
> 
> On Thu, Jul 8, 2021 at 11:47 AM Stefan Eissing
> <stefan.eiss...@greenbytes.de> wrote:
>> 
>> Some day, I knew I had to learn more about mpm_event. =)
>> 
>> Adding more DEBUGs, I see in the example below that 2 connections were 
>> idling at start of the graceful and the get added to the linger_chain. 2 
>> workers are then woken up and process the socket. connection_count stays at 
>> 2 however. As I read it, that count drops when the connection pool is 
>> destroyed/cleanup'ed. This normally seem to happen on the worker_queue_info, 
>> but in this example this just does not happen.
>> 
>> Is this a correct read?
> 
> You proxy to a local server, so the 2 connections are the incoming
> ones on the local proxy vhost and the local server vhost.
> But mod_proxy backend connections are recycled and end up in a reslist
> which is not checked until reuse.
> So when on graceful restart the local server starts lingering close
> with its kept alive connection, it's just ignored by mod_proxy and the
> MAX_SECS_TO_LINGER timeout (30s) applies.
> 
> I think we shouldn't do lingering on keepalive connections when they
> expire or get killed (graceful or max workers), this does not help the
> client anyway because any data sent on the connection is doomed, the
> sooner we RESET the faster it will try on another connection.
> 
> So how about the attached patch that closes the connections when their
> keepalive expires (including when it's shortened)?
> There are other changes in there, like more trace logs that helped me
> debug things, they are worth it too IMHO.
> Also I changed reqevents = POLLIN to POLLIN|POLLHUP (by generalizing
> and using update_reqevents_from_sense() in more places), because I'm
> not sure that we catch the connections closed remotely otherwise
> (connection_count should go down to 1 more quickly in the local
> proxy+server case because the local server's client connection should
> have responded to the lingering close almost instantly, but this one
> seems to timeout too).
> 
> Cheers;
> Yann.
> <event_ka_no_linger.diff>


This seems to be it! Yann strikes again! And I learned something...\o/

I needed to make small tweaks, because no all previous close checked the
return value and I got assert failures on your patch. I attach my version
below. Instead of the param, there could also be 2 functions, one without
checks. A matter of taste.

So, I try to summarise:

- the effect was triggered by the keep-alive connections remote endpoints
  slumbering in a proxy connection pool where no one was monitoring the 
  sockets and reacting to TCP FIN packets.
- This triggered a long LINGER timeout that blocked the listener from
  exiting. Ultimately leading to the parent killing the child.
- This situation could also happen, I assume, when a client drops from
  the network, e.g. a cell phone entering bad coverage, where no one
  in the network path really knows what to do? Or a network connectivity
  issue or a comp crashing behind a NAT?

Cheers, Stefan

Attachment: event_ka_no_lingerv2.diff
Description: Binary data

Reply via email to