[Bug 57520] proxy: worker failure disrupts existing connections

bugzilla Wed, 04 Feb 2015 16:08:05 -0800

https://issues.apache.org/bugzilla/show_bug.cgi?id=57520


--- Comment #6 from Yann Ylavic <[email protected]> ---
(In reply to dk from comment #5)
> Perhaps you could suggest a couple mod_proxy configurations for me to test
> with the traffic recorder on? I should note, it might take a few runs
> (generating pretty big logs) since the error behavior is quite flaky and
> occurs sporadically - depending on how the Jetty restart is exactly timed.

Given your load scenario and "sporadic error behavior", it is probably not a
good idea to use a network capture (would indeed produce a big file, but mainly
huge time to analyse it).

You could restrict captured packets' size with tcpdump -s or just filter
SYN/FIN/RST (and then find connections terminated prematuraly, ie. those with
low sequence numbers at the end, correlated with error log timestamps), still
probably quite painful to analyse with high traffic.

Before that, maybe we can think more about what's happening...

First, please note (as already requested) that the relevant part of your
configuration (<Proxy balancer:...></Proxy>, ProxyPass, ..., or the complete
file) would help determine the expected httpd's behaviour.

Otherwise, can you see "child pid <pid> exit signal <sig>..." messages in the
global log file (the one pointed to by the main ErrorLog directive, out of any
VirtualHost)?
Those would mean a crash in httpd (children) and could explain why some
connections are closed forcibly (by the system) before the request is sent, as
detected on the jetty side.

Other than that, mod_proxy will *never* close any connection without even
trying to send the request it was created for (but read failures on the client
side), even if the backend (LoadBalancerMember) has switched to recovery state
in the meantime (because of some simultaneous request's failure).
So this is not something that should/could be fixed (per comment #1), it just
should not happen.

The only reason (I see) for which jetty could get an EOF (without data) is a
race condition between a connect() timeout on the proxy side and a
(simultaneous) accept() on the jetty side (that would cause "The timeout
specified has expired" instead of "Connection refused" in the log, though).

> "Connection refused" messages in the error log are expected since after
> Jetty JVM exits it won't be listening on ports while it is restarting. While
> Jetty is still shutting down and waiting for existing requests to finish it
> responds with 503. In httpd I have failonstatus set to handle that.

BTW, in both cases ("connection refused" and 503) mod_proxy will put the
backend in recovery state for a period of retry=<seconds> (60 by default),
which is what is intended (IIUC).
So, all the "connection refused" messages should only appear once jetty is
completely down and until jetty is up again, and only when the backend is being
retried (the number of lines should then be the number of simultaneous requests
elected for that backend at that time).
So far, so good.

But in this scenario there shouldn't be any "Connection reset by peer: proxy:
error reading status line from remote server", which indicates that an
established connection was *reset* by jetty with no response at all, no 200 nor
503...
This error won't put the bachend in recovery state (the connect()ion succeed
and there is no status to fail on), but with proxy-initial-not-pooled and a
normal browser as client, this one will resend the request without notifying
the user.

>From the log lines you provided in comment #3 :
> [Sun Feb 01 16:40:06 2015] [error] [client xx.xx.xx.xx] (104)Connection reset 
> by peer: proxy: error reading status line from remote server host1:xxxx
> [Sun Feb 01 16:40:06 2015] [error] [client xx.xx.xx.xx] proxy: Error reading 
> from remote server returned by /my_uri
> [Sun Feb 01 16:40:06 2015] [error] [client xx.xx.xx.xx] (104)Connection reset 
> by peer: proxy: error reading status line from remote server host2:xxxx
> [Sun Feb 01 16:40:06 2015] [error] [client xx.xx.xx.xx] proxy: Error reading 
> from remote server returned by /my_uri
> [Sun Feb 01 16:40:06 2015] [error] (111)Connection refused: proxy: HTTP: 
> attempt to connect to xx.xx.xx.xx:xxxx (host1) failed
> [Sun Feb 01 16:40:06 2015] [error] ap_proxy_connect_backend disabling worker 
> for (host1)
it suggests that reading errors can arise just before jetty is down (not
connectable anymore), so it seems that the graceful shutdown is missing some
already established connections...
In fact this is also racy, between the time when jetty detects/decides there is
no pending connection (no more 403) and the one the listening socket is really
closed, there may be new connection handshaken (TCP speaking) by the OS (up to
the listen backlog size).
These connections will typically be reset once the listening socket is really
closed.

> 
> As far as ttl/keepalives - yes this reported behavior is seen under
> "proxy-nokeepalive" too. TTL is the only setting you mentioned that I have
> not tried, but I presume it is ignored under nokeepalive regime?

Correct, I just mentioned this based on your "keepalive=On" + "Keepalive 600"
configuration from comment #3, and only to notice that ttl lower than backend's
KeepAliveTimeout is probably a better alternative with regard to
performances/resources (since connections to jetty would be reused), and
possibly also with regard to fault tolerance (see below).

That would depend on how jetty handles kept alive (idle) connections on
gracefull restart (in your scenario).
If those are closed immediatly the issue remain that mod_proxy may reuse
connections closed before they can be detected as such on its side.
But if they are closed only above KeepAliveTimeout while still answered with
503 when some request arrives (on time), everything is fine.

This "mode" may also be better for fault tolerance too (provided "everything is
fine" above), because now mod_proxy will likely reuse already established
connections and should receive 403s on all of them (for all simultaneous
requests), until the recovery state.

Although this is just (my) theoretical thought...

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[Bug 57520] proxy: worker failure disrupts existing connections

Reply via email to