Hi list! I'm trying to debug an interesting problem where we randomly get a "high latency" response time from our upstream servers. It appears to occur in about 1.5% of all requests. Here's a description of the tests I've been running to isolate the problem within nginx:
I'm using an endpoint on the upstream servers that operates extremely quickly; a request which only responds back with the server's current local UNIX timestamp. From the nginx server, I start ApacheBench with 5,000 concurrent connections directly to the upstream server (bypassing nginx). Here is what a typical run of this test looks like: Document Path: /time/0 Document Length: 19 bytes Concurrency Level: 5000 Time taken for tests: 0.756 seconds Complete requests: 5000 Failed requests: 0 Write errors: 0 Total transferred: 1110000 bytes HTML transferred: 95000 bytes Requests per second: 6617.33 [#/sec] (mean) Time per request: 755.592 [ms] (mean) Time per request: 0.151 [ms] (mean, across all concurrent requests) Transfer rate: 1434.62 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 4 63 53.7 35 167 Processing: 22 44 19.1 38 249 Waiting: 17 35 18.8 30 243 Total: 55 107 64.4 73 401 Percentage of the requests served within a certain time (ms) 50% 73 66% 77 75% 84 80% 202 90% 222 95% 231 98% 250 99% 251 100% 401 (longest request) And here's the same test with the longest response times I've seen: Document Path: /time/0 Document Length: 19 bytes Concurrency Level: 5000 Time taken for tests: 0.807 seconds Complete requests: 5000 Failed requests: 0 Write errors: 0 Total transferred: 1110000 bytes HTML transferred: 95000 bytes Requests per second: 6197.08 [#/sec] (mean) Time per request: 806.831 [ms] (mean) Time per request: 0.161 [ms] (mean, across all concurrent requests) Transfer rate: 1343.51 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 3 45 51.8 17 144 Processing: 10 29 24.4 23 623 Waiting: 9 25 24.4 18 623 Total: 26 75 67.4 39 626 Percentage of the requests served within a certain time (ms) 50% 39 66% 42 75% 45 80% 173 90% 190 95% 199 98% 217 99% 218 100% 626 (longest request) Not bad. Now, keep in mind, this is a SINGLE upstream server handling these requests over the network. Once I change my test to point ab at the local nginx, the strange latency issue rears its ugly head. I have 4 upstream servers in my config. Here's what the same test through nginx looks like: Concurrency Level: 5000 Time taken for tests: 1.602 seconds Complete requests: 5000 Failed requests: 0 Write errors: 0 Total transferred: 1170000 bytes HTML transferred: 95000 bytes Requests per second: 3121.08 [#/sec] (mean) Time per request: 1602.012 [ms] (mean) Time per request: 0.320 [ms] (mean, across all concurrent requests) Transfer rate: 713.21 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 109 172 39.4 167 246 Processing: 106 505 143.3 530 1248 Waiting: 103 504 143.5 530 1248 Total: 344 677 108.6 696 1358 Percentage of the requests served within a certain time (ms) 50% 696 66% 723 75% 741 80% 752 90% 768 95% 779 98% 786 99% 788 100% 1358 (longest request) Ack! It's like nginx decides to drop an extra second on some requests for no reason. I've also recorded these test runs with nginx's access log. Here's the log format, first: log_format main '$remote_addr - - ' ## User's IP Address '[$time_local] ' ## DateTime '"$request" ' ## User's Request URL '$status ' ## HTTP Code '$body_bytes_sent ' ## Bytes BODY ONLY '"$http_referer" ' ## User's Referer '"$http_user_agent" ' ## User's Agent '$request_time ' ## NginX Response '$upstream_response_time ' ## Upstream Response '$bytes_sent ' ## Bytes Sent (GZIP) '$request_length'; ## String Length The access log has 10,000 lines total (i.e. two of these tests with 5,000 concurrent connections), and when I sort by upstream_response_time, I get a log with the first 140 lines having about 1s on the upstream_response_time, and the remaining 9,860 lines show 700ms and less. Here's a snippet showing the strangeness, starting with line numbers: 1: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0" 200 19 "-" "ApacheBench/2.3" 1.027 1.026 234 83 2: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0" 200 19 "-" "ApacheBench/2.3" 1.027 1.026 234 83 3: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0" 200 19 "-" "ApacheBench/2.3" 1.026 1.025 234 83 ... 138: 127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0" 200 19 "-" "ApacheBench/2.3" 1.000 0.999 234 81 139: 127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0" 200 19 "-" "ApacheBench/2.3" 0.999 0.999 234 81 140: 127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0" 200 19 "-" "ApacheBench/2.3" 0.999 0.999 234 81 141: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0" 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83 142: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0" 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83 143: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0" 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83 ... 9998: 127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0" 200 19 "-" "ApacheBench/2.3" 0.142 0.005 234 81 9999: 127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0" 200 19 "-" "ApacheBench/2.3" 0.142 0.005 234 81 10000: 127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0" 200 19 "-" "ApacheBench/2.3" 0.122 0.002 234 81 The upstream_response_time difference between line 140 and 141 is nearly 500ms! The total request_time also displays an interesting gap of almost 300ms. What's going on here? The kernels have been tuned on all servers for a high number of open files, and tcp buffers: $ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 119715 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 119715 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited $ cat /proc/sys/net/core/*mem_* 184217728 184217728 184217728 184217728 184217728 Also for reference, here is part of my nginx.conf which may be useful for diagnosis: worker_processes 7; worker_rlimit_nofile 500000; events { use epoll; worker_connections 500000; multi_accept on; } http { log_format main ... access_log ... ## ----------------------------------------------------------------------- ## TCP Tuning ## ----------------------------------------------------------------------- sendfile off; tcp_nopush off; tcp_nodelay on; ## ----------------------------------------------------------------------- ## Max Data Size ## ----------------------------------------------------------------------- client_max_body_size 1k; client_body_buffer_size 1k; client_header_buffer_size 32k; large_client_header_buffers 200 32k; ## ----------------------------------------------------------------------- ## GZIP ## ----------------------------------------------------------------------- gzip on; gzip_min_length 1000; gzip_disable msie6; gzip_proxied any; gzip_buffers 100 64k; gzip_types text/javascript; ## ----------------------------------------------------------------------- ## Proxy Load Distribution ## ----------------------------------------------------------------------- proxy_redirect off; proxy_connect_timeout 5; proxy_send_timeout 5; proxy_read_timeout 8; proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504; proxy_buffering off; ## ----------------------------------------------------------------------- ## Hide 'Server: nginx' Server Header ## ----------------------------------------------------------------------- server_tokens off; proxy_pass_header Server; upstream ... server ... }
_______________________________________________ nginx mailing list nginx@nginx.org http://mailman.nginx.org/mailman/listinfo/nginx