[
https://issues.apache.org/jira/browse/TS-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Leif Hedstrom updated TS-3959:
------------------------------
Fix Version/s: (was: 6.1.0)
6.2.0
> Dropped keep-alive connections not being re-established
> -------------------------------------------------------
>
> Key: TS-3959
> URL: https://issues.apache.org/jira/browse/TS-3959
> Project: Traffic Server
> Issue Type: Bug
> Affects Versions: 6.0.0
> Reporter: Nick Muerdter
> Assignee: Alan M. Carroll
> Labels: regression
> Fix For: 6.2.0
>
>
> I've observed some differences in how TrafficServer 6.0.0 behaves with
> connection retrying and outgoing keep-alive connections. I believe the
> changes in behavior might be related to this issue:
> https://issues.apache.org/jira/browse/TS-3440
> I originally wasn't sure if this was a bug, but James Peach indicated it
> sounded more like a regression on the mailing list
> (http://mail-archives.apache.org/mod_mbox/trafficserver-users/201510.mbox/%[email protected]%3e).
> What I'm seeing in 6.0.0 is that if TrafficServer has some backend keep-alive
> connections already opened, but then one of the keep-alive connections is
> closed, the next request to TrafficServer may generate a 502 Server Hangup
> response when attempting to reuse that connection. Previously, I think
> TrafficServer was retrying when it encountered a closed keep-alive
> connection, but that is no longer the case. So if you have a backend that
> might unexpectedly close its open keep-alive connections, the only way I've
> found to completely prevent these 502 errors in 6.0.0 is to disable outgoing
> keepalive (proxy.config.http.keep_alive_enabled_out and
> proxy.config.http.keep_alive_post_out settings).
> For a slightly more concrete example of what can trigger this, this is fairly
> easy to reproduce with the following setup:
> - TrafficServer is proxying to nginx with outgoing keep-alive connections
> enabled (the default).
> - Throw a constant stream of requests at TrafficServer.
> - While that constant stream of requests is happening, also send a regular
> stream of SIGHUP commands to nginx to reload nginx.
> - Eventually you'll get some 502 Server Hangup responses from TrafficServer
> among your stream of requests.
> SIGHUPs in nginx should result in zero downtime for new requests, but I think
> what's happening is that TrafficServer may fail when an old keep-alived
> connection is reused (it's not common, so it depends on the timing of things
> and if the connection is from an old nginx worker that has since been shut
> down). In TrafficServer 5.3.1 these connection failures were retried, but in
> 6.0.0, no retries occur in this case.
> Here's some debug logs that show the difference in behavior between 6.0.0 and
> 5.3.1. Note that differences seem to stem from how each version eventually
> handles the "VC_EVENT_EOS" event following
> "&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".
> 5.3.1:
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316
> 6.0.0:
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314
> Interestingly, if I'm understand the log files correctly, it looks like
> TraffficServer is reporting an odd empty response from these connections
> ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I can
> tell from TCP dumps on the system, nginx is not actually sending any form of
> response.
> In these example cases the backend server isn't sending back any data (at
> least as far as I can tell), so from what I understand (and the logic
> outlined in https://issues.apache.org/jira/browse/TS-3440), it should be safe
> to retry.
> Let me know if I can provide any other details. Or if exact scripts to
> reproduce the issues against the example nginx backend I described above
> would be useful, I could get that together.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)