shinrich opened a new pull request #7237:
URL: https://github.com/apache/trafficserver/pull/7237


   We first noticed this problem because we were using the per origin conntrack 
mechanism.  And one of the origins we had set for a limit of 3000 was getting 
stuck with the limit permanently set, so all requests for that origin were 
returning 502. Looking more closely, there were 3000 connections to that origin 
in the CLOSE-WAIT state (i.e. the peer had sent the FIN but ATS had not 
responded.
   
   We took one of the machines in this state out of rotation, created a core 
file, and looked at the cop_list on the net handler object on one of the 
threads.  Eventually found a netvc that was HTTP and pointing at our problem 
origin.  Looking at the history for the associated state machine, we see that 
it had been iterating on 
   
   ```
   {location = {file = 0x78acc0 
"../../../../../../_vcs/trafficserver9/proxy/http/HttpSM.cc", 
           func = 0x78ee20 <HttpSM::state_send_server_request_header(int, 
void*)::__FUNCTION__> "state_send_server_request_header", line = 2026}, event = 
105, reentrancy = 1}, {location = {
           file = 0x78acc0 
"../../../../../../_vcs/trafficserver9/proxy/http/HttpSM.cc", func = 0x78e830 
<HttpSM::handle_server_setup_error(int, void*)::__FUNCTION__> 
"handle_server_setup_error", line = 5573}, 
         event = 105, reentrancy = 1},
   ```
   For the whole history.   So presumably the inactivity_cop was sending the 
inactivity timeout event every 5 minutes (our setting for the default 
inactivity timeout), but the HttpSM was never getting killed.  So I created a 
one-off build that added a bool to the HttpSM and an assert to detected with 
the same state machine passed through state_send_server_request_header with the 
inactivity timeout twice.  Once we got a core from that, we could see that the 
client had been using HTTP2 and was sending a POST.  The tunnel is still alive 
in this case. It is set up as the post tunnel. The consumer side is finished 
(alive = false). It has written 2822 of 2822. The producer side is not finished 
(alive = true). It has read 0 of 2822.  In the ua_txn recv_end_stream is false. 
 So the HTTP/2 client has sent all the expected data (content-length had been 
set to 2822) but it did not send a DATA frame with the EOS bit set.  A bad 
client, but it should not have caused this zombie state.
   
   This PR fixes this issue in our production.  We see no CLOSE-WAIT 
connections for the origin in question.
   
   The key change is adjusting the logic for HttpTunnel::is_tunnel_alive. The 
original logic will return true (alive) if any producer or any consumer is 
alive. This was causing the problem in handle_server_setup_error. The tunnel 
had 1 alive producer the (the HTTP/2 post user agent) but the consumer had 
completed. All the post data had been passed to the server, but the HTTP/2 user 
agent had not received the EOS bit, so it would be stuck in this state forever.
   
   Since the tunnel was alive handle_server_setup_error assumed there was at 
least one active consumer left, so it would defer sending the error message 
until the consumer (which does not exist) completed.
   
   I changed the is_tunnel_alive logic logic to only check for active consumers 
(or an active producer with not consumers which occurs if post buffering is 
enabled). This causes the handle_server_setup_error to complete and set up an 
error response even if the tunnel producer is still active.
   
   I had to adjust a few other asserts/checks to deal with slow EOS 
READ_COMPLETES. My branch passes autest and regression tests and has been 
running merrily on one of our boxes in the problem colo today..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to