Thomas Jackson created TS-4509:
----------------------------------

             Summary: Dropped keep-alive connections not being re-established 
(TS-3959 continues)
                 Key: TS-4509
                 URL: https://issues.apache.org/jira/browse/TS-4509
             Project: Traffic Server
          Issue Type: Bug
          Components: Core, Network
            Reporter: Thomas Jackson
            Assignee: Thomas Jackson
            Priority: Blocker
             Fix For: 7.0.0


I've observed some differences in how TrafficServer 6.0.0 behaves with 
connection retrying and outgoing keep-alive connections. I believe the changes 
in behavior might be related to this issue: 
https://issues.apache.org/jira/browse/TS-3440

I originally wasn't sure if this was a bug, but James Peach indicated it 
sounded more like a regression on the mailing list 
(http://mail-archives.apache.org/mod_mbox/trafficserver-users/201510.mbox/%[email protected]%3e).

What I'm seeing in 6.0.0 is that if TrafficServer has some backend keep-alive 
connections already opened, but then one of the keep-alive connections is 
closed, the next request to TrafficServer may generate a 502 Server Hangup 
response when attempting to reuse that connection. Previously, I think 
TrafficServer was retrying when it encountered a closed keep-alive connection, 
but that is no longer the case. So if you have a backend that might 
unexpectedly close its open keep-alive connections, the only way I've found to 
completely prevent these 502 errors in 6.0.0 is to disable outgoing keepalive 
(proxy.config.http.keep_alive_enabled_out and 
proxy.config.http.keep_alive_post_out settings).

For a slightly more concrete example of what can trigger this, this is fairly 
easy to reproduce with the following setup:

- TrafficServer is proxying to nginx with outgoing keep-alive connections 
enabled (the default).
- Throw a constant stream of requests at TrafficServer.
- While that constant stream of requests is happening, also send a regular 
stream of SIGHUP commands to nginx to reload nginx.
- Eventually you'll get some 502 Server Hangup responses from TrafficServer 
among your stream of requests.

SIGHUPs in nginx should result in zero downtime for new requests, but I think 
what's happening is that TrafficServer may fail when an old keep-alived 
connection is reused (it's not common, so it depends on the timing of things 
and if the connection is from an old nginx worker that has since been shut 
down). In TrafficServer 5.3.1 these connection failures were retried, but in 
6.0.0, no retries occur in this case.

Here's some debug logs that show the difference in behavior between 6.0.0 and 
5.3.1. Note that differences seem to stem from how each version eventually 
handles the "VC_EVENT_EOS" event following 
"&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".

5.3.1: 
https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316
6.0.0: 
https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314

Interestingly, if I'm understand the log files correctly, it looks like 
TraffficServer is reporting an odd empty response from these connections 
("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I can 
tell from TCP dumps on the system, nginx is not actually sending any form of 
response.

In these example cases the backend server isn't sending back any data (at least 
as far as I can tell), so from what I understand (and the logic outlined in 
https://issues.apache.org/jira/browse/TS-3440), it should be safe to retry.

Let me know if I can provide any other details. Or if exact scripts to 
reproduce the issues against the example nginx backend I described above would 
be useful, I could get that together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to