I've been chasing down non-deterministic, random build fails. make
randomly blows up with

write error: stdout

A lot of stracing, experimentation, and pains, determined that this is
due to a sequence of the following events:

1) how make uses the job server file descriptor

2) an intentional leak of the standard error file descriptor in ccache

3) LTO linking results in linker recursively invoking make (that was a
new one to me)

4) A rather nuanced behavior of the Linux PTY driver that can be best
explained with the following short demo:

#include <sys/types.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/stat.h>

int main()
{
        struct stat state_buf;

        for (int i=0; i<8; ++i)
        {
                if (fstat(i, &state_buf) == 0)
                {
                        printf("%d: %d/%d\n", i,
                               (int)state_buf.st_dev,
                               (int)state_buf.st_ino);
                }
        }

        printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? "y":"n");

        fcntl(2, F_SETFL, O_NONBLOCK);

        printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? "y":"n");
        return 0;
}

I'm sshed into my server, and the output of the above is, for me:

0: 25/3
1: 25/3
2: 25/3
NON BLOCKING: n
NON BLOCKING: y

My stdin, stdout, and stderr is the same dev/ino, and setting non-blocking
mode one any one of them puts them all in non-blocking mode. I'm setting non-
blocking on standard error, and my standard input also becomes non-blocking.

After sifting through sources, and straces, this is what I see is happening.

Starting with make, and how it sets up its job file descriptors:

jobserver_parse_auth (const char *auth)
{
  /* Given the command-line parameter, parse it.  */
  if (sscanf (auth, "%d,%d", &job_fds[0], &job_fds[1]) != 2)

  // …

  /* Make sure our pipeline is valid, and (possibly) create a duplicate pipe,
     that will be closed in the SIGCHLD handler.  If this fails with EBADF,
     the parent has closed the pipe on us because it didn't think we were a
     submake.  If so, warn and default to -j1.  */

  if (!FD_OK (job_fds[0]) || !FD_OK (job_fds[1]) || make_job_rfd () < 0)
}

The TLDR of the above: make reads the job server's file descriptors
from the MAKEFLAGS environment variable, then checks here if they
actually exist. If they don't exist, make will create the job server
pipe. Important: by default they will be file descriptors 3 and 4.
This becomes a key player in this mystery, a little bit later.

When make spawns a child job (other than a recursive make) the job
server file descriptors get closed (marked O_CLOEXEC before the actual
execve), but MAKEFLAGS remains in the environment. They remain open
for a recursive make call, and the recursive instance picks up the
ball from here.

The problem with this approach is that if, somehow, a child process
runs make, and there happens to be a pair of file descriptors here,
make will assume that they're the job server file descriptors.

Another thing that make does is that it sets one of the file
descriptors to non-blocking mode. This is in jobserver_setup:

  /* When using pselect() we want the read to be non-blocking.  */
  set_blocking (job_fds[0], 0);

Now we get to ccache which, amongst things, does the following, in its
ccache.cpp:

// Make a copy of stderr that will not be cached, so things like distcc can
// send networking errors to it.
static void
set_up_uncached_err()
{
  int uncached_fd =
    dup(STDERR_FILENO); // The file descriptor is intentionally leaked.
  if (uncached_fd == -1) {
    log("dup(2) failed: {}", strerror(errno));
    throw Failure(Statistic::internal_error);
  }

  Util::setenv("UNCACHED_ERR_FD", fmt::format("{}", uncached_fd));
}

TLDR: it intentionally leaks a file descriptor. A basic standalone
strace of a standalone cache invocation showed ccache leaking file
descriptor 2 to 3. Now, somewhere else I was also getting file
descriptor 4, I haven't tracked down where, exactly. But it was there,

The final ingredient to this perfect storm, is LTO linking. I have
verified that -flto=auto -ffat-lto-objects results in ld running make.
If you compile helloworld, or something, with -flto=auto
-ffat-lto-objects, and link it:

strace -s 256 -o z -f gcc -flto=auto -ffat-lto-objects -o t t.o

strace will show that, somehow, make gets involved:

123302 execve("/usr/bin/make", ["make", "-f", "/tmp/ccyRcNmv.mk", "-j32",
"all"], 0x234ed40 /* 47 vars */ <unfinished …>

And now, let's go back to what make is doing, here:

static void
close_stdout (void)
{
  int prev_fail = ferror (stdout);
  int fclose_fail = fclose (stdout);

  if (prev_fail || fclose_fail)
    {
      if (fclose_fail)
        perror_with_name (_("write error: stdout"), "");
      else
        O (error, NILF, _("write error: stdout"));
      exit (MAKE_TROUBLE);
    }
}

So, the grand total, this is what was happening when I was building a
large hairball, using make, distcc, cache, and when using LTO:

1) The original instance of make came up, and set up a job server on
file descriptors 3 and 4, and put them into MAKEFLAGS.

2) This was a large build, in recursive directories, with recursive
make invocations.

3) A link command was executed, invoking ccache. Before exec-ing it,
make set CLOEXEC on both file descriptors 3 and 4, and ccache did not
inherit them.

4) ccache duped file descriptor 2 to 3, and invoked distcc, which then
invoked the real gcc to finally do the LTO link. At some point along
the line something got leaked to file descriptor 4. That's the only
piece of the puzzle I've yet to track down, but it didn't seem like it
was very important to figure outwhere it came from.

5) gcc invoked the linker, which invoking make again for LTO.

6) make reads MAKEFLAGS, and sees that it should have job server file
descriptors. make sees something on file descriptors 3 and 4, thinks its the
original job server file descriptors (spoiler: they aren't).

7) this clusterfark wasn't fatal enough to blow up this link. It succeeded,
but:

 set_blocking (job_fds[0], 0);

put the fake job file descriptor in non-blocking mode. This file
descriptor was really the original terminal standard error, courtesy
of ccache. Now, both standard input, output, and error, are in
non-blocking mode.

8) My original build was -j 40. They rolled merrily along, spewing
tons of output, a good chunk of it was from make itself. One of make's
write()s to standard output was EAGAIN-ed. This resulted in the
original "write error: stdout" that prompted me to burn almost a month
chasing this down.

So, now what? Both ccache, and the linker are contributing to this
situation. ccache leaks the file descriptor (to distcc apparently?).
The linker invokes make for its own use, but it doesn't sanitize the
environment. make closes the job server file descriptors, but not
removing them from MAKEFLAGS.

In the meantime, -Orecursive seems to be a temporary bandaid. I think
it works around this frankenbug.

_______________________________________________
ccache mailing list
ccache@lists.samba.org
https://lists.samba.org/mailman/listinfo/ccache

Reply via email to