To accelerate the process, I've increased the number of threads from 4 to 8 on a 16-core machine. Ran strace for about 5s on each.
Single process (8 threads): $ strace -cp 16807 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 71.36 0.330172 21 15479 epoll_wait 13.59 0.062861 4 14477 1 write 10.58 0.048955 4 11518 10 recvfrom 4.44 0.020544 38 537 244 read 0.02 0.000094 1 135 sendto 0.01 0.000051 3 16 epoll_ctl 0.00 0.000000 0 24 close 0.00 0.000000 0 9 socket 0.00 0.000000 0 9 9 connect 0.00 0.000000 0 25 sendmsg 0.00 0.000000 0 6 shutdown 0.00 0.000000 0 24 setsockopt 0.00 0.000000 0 3 getsockopt 0.00 0.000000 0 9 fcntl 0.00 0.000000 0 8 7 accept4 ------ ----------- ----------- --------- --------- ---------------- 100.00 0.462677 42279 271 total Two processes (2x8 threads): ## Draining process % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 48.65 0.544758 30 18359 epoll_wait 28.69 0.321283 14 23540 write 22.60 0.253049 19 13338 recvfrom 0.04 0.000474 1 786 374 read 0.03 0.000287 2 187 sendto 0.00 0.000000 0 2 close 0.00 0.000000 0 1 sendmsg 0.00 0.000000 0 1 shutdown 0.00 0.000000 0 1 setsockopt ------ ----------- ----------- --------- --------- ---------------- 100.00 1.119851 56215 374 total ## "New" process % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 93.87 1.588239 11 149253 epoll_wait 3.84 0.064985 30 2140 31 recvfrom 1.77 0.029905 13 2388 write 0.34 0.005737 10 589 130 read 0.12 0.002018 38 53 close 0.06 0.000960 8 114 2 sendto 0.00 0.000031 1 25 shutdown 0.00 0.000019 0 102 sendmsg 0.00 0.000019 0 58 epoll_ctl 0.00 0.000015 0 31 fcntl 0.00 0.000000 0 31 socket 0.00 0.000000 0 31 31 connect 0.00 0.000000 0 94 setsockopt 0.00 0.000000 0 8 getsockopt 0.00 0.000000 0 47 29 accept4 ------ ----------- ----------- --------- --------- ---------------- 100.00 1.691928 154964 223 total It does indeed appear the new process is contending with the old, even with just 16 threads on a 16-core box. A third process (oversubscribed to 24 threads on 16 cores): % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 97.21 0.950863 14 69926 epoll_wait 2.32 0.022677 208 109 write 0.42 0.004106 48 85 14 recvfrom 0.04 0.000439 17 26 close 0.00 0.000022 1 34 epoll_ctl 0.00 0.000000 0 136 33 read 0.00 0.000000 0 1 brk 0.00 0.000000 0 14 socket 0.00 0.000000 0 14 14 connect 0.00 0.000000 0 15 1 sendto 0.00 0.000000 0 11 sendmsg 0.00 0.000000 0 13 1 shutdown 0.00 0.000000 0 50 setsockopt 0.00 0.000000 0 3 getsockopt 0.00 0.000000 0 14 fcntl 0.00 0.000000 0 34 22 accept4 ------ ----------- ----------- --------- --------- ---------------- 100.00 0.978107 70485 85 total During this time, each of the three processes was running at roughly 250-350% CPU. On 1/12/18 9:34 AM, Willy Tarreau wrote: > On Fri, Jan 12, 2018 at 09:28:54AM -0600, Samuel Reed wrote: >> Thanks for your quick answer, Willy. >> >> That's a shame to hear but makes sense. We'll try out some ideas for >> reducing contention. We don't use cpu-map with nbthread; I considered it >> best to let the kernel take care of this, especially since there are >> some other processes on that box. > So that definitely explains why 5 instances start to give you a high load > with 4 threads on 16 cores. Note, do you happen to see some processes > running at 100% CPU (or in fact 400% since you have 4 threads) ? It would > be possible that some remaining bugs would cause older processes and their > threads to spin a bit too much. > > If you're interested, when this happens you could run "strace -cp $pid" > for a few seconds, it will report the syscall count over that period. A > typical rule of thumb is that if you see more epoll_wait() than recvfrom() > or read(), there's an issue somewhere in the code. > >> I don't really want to fall back to >> nbproc but we may have to, at least until we get the number of reloads down. > It's possible, but let's see if there's a way to improve the situation a > bit by gathering some elements first ;-) > > Willy