[jira] [Commented] (TS-3959) Dropped keep-alive connections not being re-established

Nick Muerdter (JIRA) Mon, 30 May 2016 20:11:22 -0700

    [ 
https://issues.apache.org/jira/browse/TS-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15307139#comment-15307139
 ]


Nick Muerdter commented on TS-3959:
-----------------------------------

[~jacksontj]: Thanks again for looking into this. Unfortunately, I'm still 
seeing the same behavior (occasional 502s) with the additional patch in 
[#658|https://github.com/apache/trafficserver/pull/658]. I've tried both the 
latest master (as of 
[0c5f3d7|https://github.com/apache/trafficserver/commit/0c5f3d7b1df42736936956b5396b58c04c861378]),
 and the 6.2.x branch with the patches from 
[#554|https://github.com/apache/trafficserver/pull/554] and 
[#658|https://github.com/apache/trafficserver/pull/658] applied. In both cases, 
I can still reproduce the 502 errors.

I've put together a script to hopefully make reproducing this easier: 
https://github.com/GUI/TS-3959 Basically, this sets up nginx behind 
TrafficServer, and then reloads nginx repeatedly while making a constant stream 
of requests. In TrafficServer 5.3.2, I can run the test script indefinitely 
without errors. In TrafficServer 6+ (with or without these patches), you should 
get some 502 errors within a minute of running the script.

With the new logic implemented in 
[#658|https://github.com/apache/trafficserver/pull/658], I think that these 
requests still aren't considered retryable. I haven't had a chance to run 
tcpdumps this time around, but I think based on the earlier dump, the issue is 
that TrafficServer actually successfully send the request headers when this 
happens (note the successful 2nd HTTP line):

!trafficserver-reset.png!

I believe ephemeral port used for communication must still be open from nginx, 
but nginx has already sent its {{[FIN, ACK]}} packet to TrafficServer to 
declare the connection closed. So while the port is still technically open at 
this point (which is why TrafficServer can send the request on it), the only 
data it expects is the {{[FIN, ACK]}} packet back from TrafficServer to 
complete the closure. So I think that's why TrafficServer believes it has sent 
the request headers (so TrafficServer no longer considers it retryable), but 
nginx won't actually accept any new HTTP requests on this closing connection 
(which is why it responds with the {{[RST]}} packet). The only data nginx 
accepts on this closing connection from TrafficServer is the {{[FIN, ACK]}} 
packet in response. You can see this from from the trace when the timing does 
work work out and the connection closes cleanly:

!trafficserver-closed.png!

This is admittedly somewhat of an edge-case, since it depends on the timing of 
those close packets. Sometimes the connection is able to successfully close 
(like in the above example), but if a request sneaks in and tries to reuse a 
connection as it's being closed, then it triggers these 502 errors in 
TrafficServer 6+ (whereas, they were retried in 5.3).

It would be great if there's a way to figure out how to cleanly handle these 
situations (maybe by knowing if the connection is in the process of being 
closed, or retrying in these situations), but I'm not sure how feasible that 
is. I'm also not sure if allowing retries (as 5.3 does), allows for potentially 
unsafe retries for requests that shouldn't be retried. Alternatively, if 
there's not a clean way to handle this, is this something that could be made 
configurable (if you're okay with the possibility of retrying unsafe things)? 
Or if this is simply out of scope, or you don't believe the current behavior is 
an issue, let me know.

Does that make sense? Hopefully the scripts to reproduce this will help, but 
let me know if you have any questions or have any trouble running them. Thanks 
again!

> Dropped keep-alive connections not being re-established
> -------------------------------------------------------
>
>                 Key: TS-3959
>                 URL: https://issues.apache.org/jira/browse/TS-3959
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core, Network
>    Affects Versions: 6.0.0
>            Reporter: Nick Muerdter
>            Assignee: Thomas Jackson
>            Priority: Blocker
>              Labels: regression
>             Fix For: 7.0.0
>
>         Attachments: trafficserver-closed.png, trafficserver-reset.png
>
>
> I've observed some differences in how TrafficServer 6.0.0 behaves with 
> connection retrying and outgoing keep-alive connections. I believe the 
> changes in behavior might be related to this issue: 
> https://issues.apache.org/jira/browse/TS-3440
> I originally wasn't sure if this was a bug, but James Peach indicated it 
> sounded more like a regression on the mailing list 
> (http://mail-archives.apache.org/mod_mbox/trafficserver-users/201510.mbox/%[email protected]%3e).
> What I'm seeing in 6.0.0 is that if TrafficServer has some backend keep-alive 
> connections already opened, but then one of the keep-alive connections is 
> closed, the next request to TrafficServer may generate a 502 Server Hangup 
> response when attempting to reuse that connection. Previously, I think 
> TrafficServer was retrying when it encountered a closed keep-alive 
> connection, but that is no longer the case. So if you have a backend that 
> might unexpectedly close its open keep-alive connections, the only way I've 
> found to completely prevent these 502 errors in 6.0.0 is to disable outgoing 
> keepalive (proxy.config.http.keep_alive_enabled_out and 
> proxy.config.http.keep_alive_post_out settings).
> For a slightly more concrete example of what can trigger this, this is fairly 
> easy to reproduce with the following setup:
> - TrafficServer is proxying to nginx with outgoing keep-alive connections 
> enabled (the default).
> - Throw a constant stream of requests at TrafficServer.
> - While that constant stream of requests is happening, also send a regular 
> stream of SIGHUP commands to nginx to reload nginx.
> - Eventually you'll get some 502 Server Hangup responses from TrafficServer 
> among your stream of requests.
> SIGHUPs in nginx should result in zero downtime for new requests, but I think 
> what's happening is that TrafficServer may fail when an old keep-alived 
> connection is reused (it's not common, so it depends on the timing of things 
> and if the connection is from an old nginx worker that has since been shut 
> down). In TrafficServer 5.3.1 these connection failures were retried, but in 
> 6.0.0, no retries occur in this case.
> Here's some debug logs that show the difference in behavior between 6.0.0 and 
> 5.3.1. Note that differences seem to stem from how each version eventually 
> handles the "VC_EVENT_EOS" event following 
> "&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".
> 5.3.1: 
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316
> 6.0.0: 
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314
> Interestingly, if I'm understand the log files correctly, it looks like 
> TraffficServer is reporting an odd empty response from these connections 
> ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I can 
> tell from TCP dumps on the system, nginx is not actually sending any form of 
> response.
> In these example cases the backend server isn't sending back any data (at 
> least as far as I can tell), so from what I understand (and the logic 
> outlined in https://issues.apache.org/jira/browse/TS-3440), it should be safe 
> to retry.
> Let me know if I can provide any other details. Or if exact scripts to 
> reproduce the issues against the example nginx backend I described above 
> would be useful, I could get that together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TS-3959) Dropped keep-alive connections not being re-established

Reply via email to