On Mon, Jan 22, 2018 at 06:16:12PM +0100, Willy Tarreau wrote:
> On Mon, Jan 22, 2018 at 05:47:55PM +0100, Willy Tarreau wrote:
> > > strace: Process 12166 attached
> > > [pid 12166] set_robust_list(0x7ff9bc9aa9e0, 24 <unfinished ...>
> > > [pid 12166] <... set_robust_list resumed> ) = 0
> > > [pid 12166] gettimeofday({1516289044, 684014}, NULL) = 0
> > > [pid 12166] mmap(NULL, 134217728, PROT_NONE,
> > > MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0 <unfinished ...>
> > > [pid 12166] <... mmap resumed> ) = 0x7ff9ac000000
> > > [pid 12166] munmap(0x7ff9b0000000, 67108864) = 0
> > > [pid 12166] mprotect(0x7ff9ac000000, 135168, PROT_READ|PROT_WRITE
> > > <unfinished ...>
> > > [pid 12166] <... mprotect resumed> ) = 0
> > > [pid 12166] mmap(NULL, 8003584, PROT_READ|PROT_WRITE,
> > > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
> > > [pid 12166] <... mmap resumed> ) = 0x7ff9baa65000
> > > [pid 12166] close(16 <unfinished ...>
> > > [pid 12166] <... close resumed> ) = 0
> > > [pid 12166] fcntl(15, F_SETFL, O_RDONLY|O_NONBLOCK <unfinished ...>
> > > [pid 12166] <... fcntl resumed> ) = 0
> >
> > Here it's getting obvious that it was a shared file descriptor :-(
>
> So I have a suspect here :
>
> - run_thread_poll_loop() runs after the threads are created
> - first thing it does is to close the master-worker pipe FD :
>
> (...)
> if (global.mode & MODE_MWORKER)
> mworker_pipe_register(mworker_pipe);
> (...)
>
> void mworker_pipe_register(int pipefd[2])
> {
> close(mworker_pipe[1]); /* close the write end of the master pipe in
> the children */
> fcntl(mworker_pipe[0], F_SETFL, O_NONBLOCK);
> (...)
> }
>
> Looks familiar with the trace above ?
>
> So I guess your config works in master-worker mode, am I right ?
Exactly, I'm running HAProxy with the "-W -db" flags.
> Note that I'm bothered with the call to protocol_enable_all() as
> well in this function since it will start the proxies multiple times
> in a possibly unsafe mode. That may explain a lot of things suddenly!
>
> I think the attached patch works around it, but I'd like your
> confirmation before cleaning it up.
I applied this single patch on top of 1.8.3, and indeed this seems much
better ! The servers are eventually in an UP state after reloading.
"grep -c EBADF" on yesterday's strace logfile returned around 900 matches.
18 matches with the patchset from your other email, and only 1 with this
patch:
write(-1, "1\n", 2) = -1 EBADF (Bad file descriptor)
There still are some "Socket error" in the logs though, and I noticed some
servers (not all) go DOWN for a couple of seconds just after reload because
of this, before coming up again 4 seconds later (I use "inter 2000 rise 2").
But at least the system recovers properly from this situation and seems to
stay stable afterwards.
I can't see any obvious close() in the strace log which would be causing
trouble (un)fortunately. I'll send the whole log to you privately, so you
can have a look.
26356 11:50:48.780107 sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL,
sun_path="/dev/log"}, msg_iov(8)=[{"<149>Jan 23 10:50:48 ", 21}, {"haproxy",
17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"toto_blue/toto_blue0 changed
its IP from to 10.10.4.45 by haproxy-dns/dns-server.", 82}, {"\n", 1}],
msg_controllen=0, msg_
flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 127
26356 11:50:48.780244 getpid() = 10
26356 11:50:48.780307 sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL,
sun_path="/dev/log"}, msg_iov(8)=[{"<149>Jan 23 10:50:48 ", 21}, {"haproxy",
17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"Server toto_blue/toto_blue0
administratively READY thanks to valid DNS answer.", 78}, {"\n", 1}],
msg_controllen=0, msg_flag
s=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 123
26356 11:50:48.780626 getpid() = 10
[...]
26358 11:50:48.818887 recvfrom(16,
"L\324\201\203\0\1\0\0\0\1\0\1\tprivate-1\nbackoffice\4stag\16infrastructure\4toto\5tld\0\0\34\0\1\300&\0\6\0\1\0\0\0\27\0I\7ns-1155\tawsdns-16\3org\0\21awsdns-hostmaster\6amazon\3com\0\0\0\0\1\0\0\34
\0\0\3\204\0\22u\0\0\1Q\200\0\0)\20\0\0\0\0\0\0\0", 513, 0, NULL, NULL) = 165
26358 11:50:48.818981 recvfrom(16,
"?M\201\203\0\1\0\0\0\1\0\1\tprivate-2\nbackoffice\4stag\16infrastructure\4toto\5tld\0\0\34\0\1\300&\0\6\0\1\0\0\0008\0I\7ns-1155\tawsdns-16\3org\0\21awsdns-hostmaster\6amazon\3com\0\0\0\0\1\0\0\34
\0\0\3\204\0\22u\0\0\1Q\200\0\0)\20\0\0\0\0\0\0\0", 513, 0, NULL, NULL) = 165
26358 11:50:48.819064 recvfrom(16,
"Ls\201\203\0\1\0\0\0\1\0\1\tprivate-3\nbackoffice\4stag\16infrastructure\4toto\5tld\0\0\34\0\1\300&\0\6\0\1\0\0\0\27\0I\7ns-1155\tawsdns-16\3org\0\21awsdns-hostmaster\6amazon\3com\0\0\0\0\1\0\0\34
\0\0\3\204\0\22u\0\0\1Q\200\0\0)\20\0\0\0\0\0\0\0", 513, 0, NULL, NULL) = 165
26358 11:50:48.819146 recvfrom(16, 0x7f05f1433c10, 513, 0, NULL, NULL) = -1
EAGAIN (Resource temporarily unavailable)
26356 11:50:48.820330 getpid() = 10
26356 11:50:48.820410 sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL,
sun_path="/dev/log"}, msg_iov(8)=[{"<145>Jan 23 10:50:48 ", 21}, {"haproxy",
17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"Server toto_blue/toto_blue0 is
DOWN, reason: Socket error, check duration: 0ms. 31 active and 1 backup servers
left. 0 sessions active, 0 requeued, 0 remaining in queue.", 169}, {"\n", 1}],
msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 214
26356 11:50:48.820591 getpid() = 10
26356 11:50:48.820662 sendmsg(6, {msg_name(110)={sa_family=AF_LOCAL,
sun_path="/dev/log"}, msg_iov(8)=[{"<145>Jan 23 10:50:48 ", 21}, {"haproxy",
17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"Server toto_blue/toto_blue1 is
DOWN, reason: Socket error, check duration: 0ms. 30 active and 1 backup servers
left. 0 sessions active, 0 requeued, 0 remaining in queue.", 169}, {"\n", 1}],
msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 214
26356 11:50:48.820802 getpid() = 10
[...]
26357 11:50:52.704807 sendmsg(19, {msg_name(110)={sa_family=AF_LOCAL,
sun_path="/dev/log"}, msg_iov(8)=[{"<149>Jan 23 10:50:52 ", 21}, {"haproxy",
17}, {"[", 1}, {"10", 2}, {"]: ", 3}, {"", 0}, {"Server toto_blue/toto_blue0 is
UP, reason: Layer7 check passed, code: 200, info: \"OK\", check duration: 16ms.
1 active and 1 backup servers online. 0 sessions requeued, 0 total in queue.",
185}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) =
230
Thanks again for your help !
Marc