shinrich opened a new pull request #7237:
URL: https://github.com/apache/trafficserver/pull/7237
We first noticed this problem because we were using the per origin conntrack
mechanism. And one of the origins we had set for a limit of 3000 was getting
stuck with the limit permanently set, so all requests for that origin were
returning 502. Looking more closely, there were 3000 connections to that origin
in the CLOSE-WAIT state (i.e. the peer had sent the FIN but ATS had not
responded.
We took one of the machines in this state out of rotation, created a core
file, and looked at the cop_list on the net handler object on one of the
threads. Eventually found a netvc that was HTTP and pointing at our problem
origin. Looking at the history for the associated state machine, we see that
it had been iterating on
```
{location = {file = 0x78acc0
"../../../../../../_vcs/trafficserver9/proxy/http/HttpSM.cc",
func = 0x78ee20 <HttpSM::state_send_server_request_header(int,
void*)::__FUNCTION__> "state_send_server_request_header", line = 2026}, event =
105, reentrancy = 1}, {location = {
file = 0x78acc0
"../../../../../../_vcs/trafficserver9/proxy/http/HttpSM.cc", func = 0x78e830
<HttpSM::handle_server_setup_error(int, void*)::__FUNCTION__>
"handle_server_setup_error", line = 5573},
event = 105, reentrancy = 1},
```
For the whole history. So presumably the inactivity_cop was sending the
inactivity timeout event every 5 minutes (our setting for the default
inactivity timeout), but the HttpSM was never getting killed. So I created a
one-off build that added a bool to the HttpSM and an assert to detected with
the same state machine passed through state_send_server_request_header with the
inactivity timeout twice. Once we got a core from that, we could see that the
client had been using HTTP2 and was sending a POST. The tunnel is still alive
in this case. It is set up as the post tunnel. The consumer side is finished
(alive = false). It has written 2822 of 2822. The producer side is not finished
(alive = true). It has read 0 of 2822. In the ua_txn recv_end_stream is false.
So the HTTP/2 client has sent all the expected data (content-length had been
set to 2822) but it did not send a DATA frame with the EOS bit set. A bad
client, but it should not have caused this zombie state.
This PR fixes this issue in our production. We see no CLOSE-WAIT
connections for the origin in question.
The key change is adjusting the logic for HttpTunnel::is_tunnel_alive. The
original logic will return true (alive) if any producer or any consumer is
alive. This was causing the problem in handle_server_setup_error. The tunnel
had 1 alive producer the (the HTTP/2 post user agent) but the consumer had
completed. All the post data had been passed to the server, but the HTTP/2 user
agent had not received the EOS bit, so it would be stuck in this state forever.
Since the tunnel was alive handle_server_setup_error assumed there was at
least one active consumer left, so it would defer sending the error message
until the consumer (which does not exist) completed.
I changed the is_tunnel_alive logic logic to only check for active consumers
(or an active producer with not consumers which occurs if post buffering is
enabled). This causes the handle_server_setup_error to complete and set up an
error response even if the tunnel producer is still active.
I had to adjust a few other asserts/checks to deal with slow EOS
READ_COMPLETES. My branch passes autest and regression tests and has been
running merrily on one of our boxes in the problem colo today..
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]