[
https://issues.apache.org/jira/browse/TS-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nick Muerdter updated TS-3959:
------------------------------
Attachment: trafficserver-reset.png
trafficserver-closed.png
[~jacksontj] Thanks for the update. I gave it a spin, but unfortunately, the
problem I was seeing still persists in the 6.2x branch with the patch from
TS-4328 applied.
The issue seems to be that in my test case, the request headers are actually
being sent from trafficserver to the nginx server (and received on nginx's
port). However, they are being sent on the keepalive connection that was
already closed, so the request fails. I think in TrafficServer 5.3, these
request were then being retried, but the new logic in TS-4328 means that these
requests will never be retried (regardless of the
proxy.config.http.connect_attempts_max_retries settings), since the headers
were in fact sent.
This issue sounds somewhat similar to the issue described here:
https://forum.nginx.org/read.php?2,197927,198435 There the nginx folks seem to
indicate that the problem is with the client not handling the keepalive close.
After looking at some TCP dumps, I'm wondering if this is perhaps some race
condition with TrafficServer in dealing with the closed connections. It looks
like when this problem crops up, nginx is sending the {{[FIN, ACK]}} packet to
TrafficServer (to close the connection), but then TrafficServer is sending
another HTTP request on that socket before acknowledging the closed connection.
As an example, here's what happens when TrafficServer appears to properly
acknowledge the closed connection from nginx and no errors occur. In this case,
TrafficServer is on port 13009, and port 47969 is the ephemeral port being used
between TrafficServer and nginx (keepalive):
!trafficserver-closed.png!
As you can see, nginx sends TrafficServer the {{[FIN, ACK]}} packet, and
TrafficServer acknowledges/closes with its own {{[FIN, ACK]}}. TrafficServer
makes no further attempts to send data on this socket.
However, in the problem cases I'm sometimes seeing that lead to 502 errors, it
appears like TrafficServer is sending a request on the port before fully
closing it. In this case, TrafficServer is on port 13009, and port 33137 is the
ephemeral port being used between TrafficServer and nginx (keepalive):
!trafficserver-reset.png!
In this case, nginx is sending the {{[FIN, ACK]}} packet to TrafficServer, but
before TrafficServer acknowledges this closure with its own {{[FIN, ACK]}}
packet, TrafficServer has already attempted to send another HTTP request on
this socket. Since the socket is closed, it results in the {{RST}} packet,
which leads to the 502 error.
I'm not really sure about the TrafficServer internals, but do you see a way to
handle this case separately from the slow connections addressed in TS-4328? Is
there some way to detect when a close is in progress and prevent the initial
request from going out? If not, can these disconnects be distinguished from
other types of failures and can retries be allowed?
And if a reproducible test case would be helpful, I can try to extract our
tests that trigger this into a more isolated case.
> Dropped keep-alive connections not being re-established
> -------------------------------------------------------
>
> Key: TS-3959
> URL: https://issues.apache.org/jira/browse/TS-3959
> Project: Traffic Server
> Issue Type: Bug
> Components: Core, Network
> Affects Versions: 6.0.0
> Reporter: Nick Muerdter
> Assignee: Alan M. Carroll
> Priority: Blocker
> Labels: regression
> Fix For: 7.0.0
>
> Attachments: trafficserver-closed.png, trafficserver-reset.png
>
>
> I've observed some differences in how TrafficServer 6.0.0 behaves with
> connection retrying and outgoing keep-alive connections. I believe the
> changes in behavior might be related to this issue:
> https://issues.apache.org/jira/browse/TS-3440
> I originally wasn't sure if this was a bug, but James Peach indicated it
> sounded more like a regression on the mailing list
> (http://mail-archives.apache.org/mod_mbox/trafficserver-users/201510.mbox/%[email protected]%3e).
> What I'm seeing in 6.0.0 is that if TrafficServer has some backend keep-alive
> connections already opened, but then one of the keep-alive connections is
> closed, the next request to TrafficServer may generate a 502 Server Hangup
> response when attempting to reuse that connection. Previously, I think
> TrafficServer was retrying when it encountered a closed keep-alive
> connection, but that is no longer the case. So if you have a backend that
> might unexpectedly close its open keep-alive connections, the only way I've
> found to completely prevent these 502 errors in 6.0.0 is to disable outgoing
> keepalive (proxy.config.http.keep_alive_enabled_out and
> proxy.config.http.keep_alive_post_out settings).
> For a slightly more concrete example of what can trigger this, this is fairly
> easy to reproduce with the following setup:
> - TrafficServer is proxying to nginx with outgoing keep-alive connections
> enabled (the default).
> - Throw a constant stream of requests at TrafficServer.
> - While that constant stream of requests is happening, also send a regular
> stream of SIGHUP commands to nginx to reload nginx.
> - Eventually you'll get some 502 Server Hangup responses from TrafficServer
> among your stream of requests.
> SIGHUPs in nginx should result in zero downtime for new requests, but I think
> what's happening is that TrafficServer may fail when an old keep-alived
> connection is reused (it's not common, so it depends on the timing of things
> and if the connection is from an old nginx worker that has since been shut
> down). In TrafficServer 5.3.1 these connection failures were retried, but in
> 6.0.0, no retries occur in this case.
> Here's some debug logs that show the difference in behavior between 6.0.0 and
> 5.3.1. Note that differences seem to stem from how each version eventually
> handles the "VC_EVENT_EOS" event following
> "&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".
> 5.3.1:
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316
> 6.0.0:
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314
> Interestingly, if I'm understand the log files correctly, it looks like
> TraffficServer is reporting an odd empty response from these connections
> ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I can
> tell from TCP dumps on the system, nginx is not actually sending any form of
> response.
> In these example cases the backend server isn't sending back any data (at
> least as far as I can tell), so from what I understand (and the logic
> outlined in https://issues.apache.org/jira/browse/TS-3440), it should be safe
> to retry.
> Let me know if I can provide any other details. Or if exact scripts to
> reproduce the issues against the example nginx backend I described above
> would be useful, I could get that together.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)