Hi Joel,
On Fri, Oct 13, 2017 at 03:22:56PM +0200, Joel W Kall wrote:
> Got some results from strace. Running the reload with sudo takes about 3
> minutes and shows that it spends most of the time on:
>
> 14:39:38.077925 poll([{fd=6, events=POLLIN}], 1, -1) = ?
> ERESTART_RESTARTBLOCK (Interrupted by signal) <180.434742>
>
> But running without sudo it's a bit faster (about 2 minutes) and it spends
> a lot of time on multiple calls like this (note the 5 second timeout on the
> last row):
>
> 14:47:35.040118 ioctl(4, FIONREAD, [149]) = 0 <0.000011>
> 14:47:35.040156 recvfrom(4,
> "\201\t\201\200\0\1\0\1\0\5\0\0\10commodus\6loop54\2se\0"..., 1024, 0,
> {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("213.80.98.2")},
> [16]) = 149 <0.000011>
> 14:47:35.040200 close(4) = 0 <0.000014>
> 14:47:35.040313 brk(0x3ba7000) = 0x3ba7000 <0.000013>
> 14:47:35.040440 brk(0x3bcb000) = 0x3bcb000 <0.000012>
> 14:47:35.040542 stat("/etc/resolv.conf", {st_mode=S_IFREG|0644,
> st_size=198, ...}) = 0 <0.000014>
> 14:47:35.040591 stat("/etc/resolv.conf", {st_mode=S_IFREG|0644,
> st_size=198, ...}) = 0 <0.000013>
> 14:47:35.040635 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 4 <0.000015>
> 14:47:35.040678 fstat(4, {st_mode=S_IFREG|0644, st_size=183, ...}) = 0
> <0.000010>
> 14:47:35.040718 read(4, "127.0.0.1\tlocalhost\n127.0.1.1\tha"..., 4096) =
> 183 <0.000011>
> 14:47:35.040760 read(4, "", 4096) = 0 <0.000010>
> 14:47:35.040795 close(4) = 0 <0.000011>
> 14:47:35.040831 stat("/etc/resolv.conf", {st_mode=S_IFREG|0644,
> st_size=198, ...}) = 0 <0.000013>
> 14:47:35.040876 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4
> <0.000014>
> 14:47:35.040914 connect(4, {sa_family=AF_INET, sin_port=htons(53),
> sin_addr=inet_addr("213.80.98.2")}, 16) = 0 <0.000014>
> 14:47:35.040955 poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4,
> revents=POLLOUT}]) <0.000011>
> 14:47:35.041015 sendto(4,
> "o\5\1\0\0\1\0\0\0\0\0\0\4jaws\6loop54\2se\0\0\1\0\1", 32, MSG_NOSIGNAL,
> NULL, 0) = 32 <0.000023>
> 14:47:35.041067 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout)
> <5.005048>
>
> It seems to be doing this for each of our backends, but in some cases it
> gets a timeout which stalls the process for 5 seconds. On our other servers
> it seems to run the same calls but it doesn't timeout as often so the total
> time is much faster.
>
> So some follow up questions:
>
> 1. Does it try to connect to each backend when starting up and won't start
> until it has tested all of them? If so, can we disable this?
Apparently, it tries to do DNS resolutions, using 213.80.98.2 as a DNS
server, but the server doesn't seem to always answer, in this case it won't
answer for jaws.loop54.se, hence the timeout.
If it only happens on one server, maybe you have network issues, and it has
a hard time reaching the DNS server ?
Regards,
Olivier