shinrich edited a comment on issue #7290: URL: https://github.com/apache/trafficserver/issues/7290#issuecomment-819885851
Did some more work today on down servers in our environment. One think I hadn't noticed before was that an origin failure only contributes to the down server count if the t_state->current.server->connect_result is non-zero. That is a real error happened during the TCP/TLS connection failure. There are many messages generated in error.log where the connect_result is 0 and a failure happened between connect open and first byte from server. These transactions are available for retries, but they don't contribute to the counts to marking a server down. Locally, we are trying a built that only adds a log to error.log if it really is a connect failure. It really cut down the noise in our logs. Once we remove the noise, we see the following cases for origin connection failure in our environment ENET_SSL_CONNECT_FAILED - I added this in the case of ERROR_SSL_ERROR in the TLS handshake negotiation. It seems for us this is mostly due to server cert verification failure. Connection timed out [110] - A time out during the handshake No route to host [113] - The DNS entry for the origin is still there, but the machine has been decommissioned. Connection refused [111] - Presumably the service is down -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
