Hello - I am hoping someone on the community list can help steer me in the right direction for troubleshooting the following scenario:
I am running a cluster of 4 virtualized nginx open source 1.16.0 servers with 4 vCPU cores and 4 GB of RAM each. They serve HTTP (REST API) requests to a pool of about 40 different upstream clusters, which range from 2 to 8 servers within each upstream definition. The upstream application servers themselves have multiple workers per server. I've recently started seeing an issue where the reported response_time and typically the reported upstream_response_time the nginx access log are drastically different from the reported response on the application servers themselves. For example, on some requests the typical average response_time would be around 5ms with an upstream_response_time of 4ms. During these transient periods of high load (approximately 1200 -1400 rps), the reported nginx response_time and upstream_response_time spike up to somewhere around 1 second, while the application logs on the upstream servers are still reporting the same 4ms response time. The upstream definitions are very simple and look like: upstream rest-api-xyz { least_conn; server 10.1.1.33:8080 max_fails=3 fail_timeout=30; # production-rest-api-xyz01 server 10.1.1.34:8080 max_fails=3 fail_timeout=30; # production-rest-api-xyz02 } One avenue that I've considered but does not seem to be the case from the instrumentation on the app servers is that they're accepting the requests and queueing them in a TCP socket locally. However, running a packet capture on both the nginx server and the app server actually shows the http request leaving nginx at the end of the time window. I have not looked at this down to the TCP handshake to see if the actual negotiation is taking an excessive amount of time. I can produce this queueing scenario artificially, but it does not appear to be what's happening in my production environment in the scenario described above. Does anyone here have any experience sorting out something like this? The upstream_connect_time is not part of the log currently, but if that number was reporting high, I'm not entirely sure what would cause that. Similarly, if the upstream_connect_time does not account for most of the delay, is there anything else I should be looking at? Thanks Jordan
_______________________________________________ nginx mailing list nginx@nginx.org http://mailman.nginx.org/mailman/listinfo/nginx