Mike Baynton commented on TS-4509:

I just started using Trafficserver in front of apache with the mod-itk MPM. 
This is an interesting combination that looks to trigger this issue easily, in 
the real world, without the aid of rapid-fire parallelized test scripts.

>From mod-itk's homepage: "If you connect to httpd, make a request and then 
>make a request on the same connection that gets handled by a different uid, 
>mpm-itk simply shuts down the connection. This is perfectly legal according to 
>RFC 2616 section 8.1.4, and all major clients seem to handle it well; the web 
>server simply simulates a timeout, and the client just opens a new connection 
>and retries the request."

I use mod-itk to map different apache name-based hosts to apache workers 
running as different users. Trafficserver seems to act as a concentrator of 
incoming connections, possibly for different domains, onto existing keepalive 
connections to the backend apache. So, the wrong worker processes receive lots 
of requests for domains they don't serve, handle it as described above, and 
rather than reconnecting properly Trafficserver then serves up an error page.

I've disabled keepalive on the backend servers for now, which seems to have 
fixed it, but will watch this issue to see if I can undo that kludge. Setting 
up apache+mod-itk might be a useful way for developers to test, too. I can 
trigger this manually by pulling up pages from a couple different sites that 
both go through the same ATS->Apache stack, interleaving the requests in two 
browser tabs.

> Dropped keep-alive connections not being re-established (TS-3959 continued)
> ---------------------------------------------------------------------------
>                 Key: TS-4509
>                 URL: https://issues.apache.org/jira/browse/TS-4509
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core, Network
>            Reporter: Thomas Jackson
>            Assignee: Thomas Jackson
>            Priority: Blocker
>             Fix For: 7.1.0
>          Time Spent: 4h 50m
>  Remaining Estimate: 0h
> I've observed some differences in how TrafficServer 6.0.0 behaves with 
> connection retrying and outgoing keep-alive connections. I believe the 
> changes in behavior might be related to this issue: 
> https://issues.apache.org/jira/browse/TS-3440
> I originally wasn't sure if this was a bug, but James Peach indicated it 
> sounded more like a regression on the mailing list 
> (http://mail-archives.apache.org/mod_mbox/trafficserver-users/201510.mbox/%3cba85d5a2-8b29-44a9-acdc-e7fa8d21f...@apache.org%3e).
> What I'm seeing in 6.0.0 is that if TrafficServer has some backend keep-alive 
> connections already opened, but then one of the keep-alive connections is 
> closed, the next request to TrafficServer may generate a 502 Server Hangup 
> response when attempting to reuse that connection. Previously, I think 
> TrafficServer was retrying when it encountered a closed keep-alive 
> connection, but that is no longer the case. So if you have a backend that 
> might unexpectedly close its open keep-alive connections, the only way I've 
> found to completely prevent these 502 errors in 6.0.0 is to disable outgoing 
> keepalive (proxy.config.http.keep_alive_enabled_out and 
> proxy.config.http.keep_alive_post_out settings).
> For a slightly more concrete example of what can trigger this, this is fairly 
> easy to reproduce with the following setup:
> - TrafficServer is proxying to nginx with outgoing keep-alive connections 
> enabled (the default).
> - Throw a constant stream of requests at TrafficServer.
> - While that constant stream of requests is happening, also send a regular 
> stream of SIGHUP commands to nginx to reload nginx.
> - Eventually you'll get some 502 Server Hangup responses from TrafficServer 
> among your stream of requests.
> SIGHUPs in nginx should result in zero downtime for new requests, but I think 
> what's happening is that TrafficServer may fail when an old keep-alived 
> connection is reused (it's not common, so it depends on the timing of things 
> and if the connection is from an old nginx worker that has since been shut 
> down). In TrafficServer 5.3.1 these connection failures were retried, but in 
> 6.0.0, no retries occur in this case.
> Here's some debug logs that show the difference in behavior between 6.0.0 and 
> 5.3.1. Note that differences seem to stem from how each version eventually 
> handles the "VC_EVENT_EOS" event following 
> "&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".
> 5.3.1: 
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316
> 6.0.0: 
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314
> Interestingly, if I'm understand the log files correctly, it looks like 
> TraffficServer is reporting an odd empty response from these connections 
> ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I can 
> tell from TCP dumps on the system, nginx is not actually sending any form of 
> response.
> In these example cases the backend server isn't sending back any data (at 
> least as far as I can tell), so from what I understand (and the logic 
> outlined in https://issues.apache.org/jira/browse/TS-3440), it should be safe 
> to retry.
> Let me know if I can provide any other details. Or if exact scripts to 
> reproduce the issues against the example nginx backend I described above 
> would be useful, I could get that together.

This message was sent by Atlassian JIRA

Reply via email to