Hi Andrey,

On Sun, Jun 12, 2016 at 04:15:49PM +0300, Andrey G. wrote:
> Hi,
> 
> I am seeing quite strange and not simple to reproduce behavior found
> during stress testing of https://github.com/codingfuture/puppet-cfdb .
> After a long testing with different options, I am quite sure the issue
> comes from epoll support as "noepoll" fixes the problem.
> 
> Here's the scenario:
> 
> 1. There are several DB clusters accessible through TCP
> 2. HAProxy 1.6.5 on Debian Jessie accepts clients on UNIX sockets
> under /run/{service}/ (easier auto-configuration, security and per
> role maxconn control)
> 3. external-check python-based (yep..) scripts are used to properly
> identify ready nodes in clusters
> 4. As cluster connections may be wrapped in TLS, external-check
> scripts use dedicated no-check "listen" blocks in HAProxy, but do not
> connect directly.
> 5. As there is a known recently discovered issue with stdout/stderr
> data clobbering, I made sure external-check scripts to close mentioned
> descriptors (the example below uses /dev/null instead).

I think the issue is very similar. In fact, closing stdin/stdout is not
enough, we need to close *all* FDs there. epoll uses a specific FD and
I suspect that it is what is causing the issue. An alternate reason could
be that its presence simply shifts another FD by 1 and makes it conflict
with one FD used by the external check. Normal scripts should not assume
that fds above 2 may be used without initialization. But note that it is
also possible that epoll is assigned fd #2 and causes all the trouble.

Thus now I'd do this after the fork for external checks instead in
src/checks.c to guarantee we close all open FDs :

        if (pid == 0) {
                /* Child */
                extern char **environ;
+               int fd;
+
+               for (fd = 0; fd <= maxfd; fd++)
+                        close(fd);

Is it something you could check ? By the way, thanks a lot for your bug
report, it's very well detailed!

Willy

Reply via email to