I've been chasing down non-deterministic, random build fails. make randomly blows up with
write error: stdout A lot of stracing, experimentation, and pains, determined that this is due to a sequence of the following events: 1) how make uses the job server file descriptor 2) an intentional leak of the standard error file descriptor in ccache 3) LTO linking results in linker recursively invoking make (that was a new one to me) 4) A rather nuanced behavior of the Linux PTY driver that can be best explained with the following short demo: #include <sys/types.h> #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/stat.h> int main() { struct stat state_buf; for (int i=0; i<8; ++i) { if (fstat(i, &state_buf) == 0) { printf("%d: %d/%d\n", i, (int)state_buf.st_dev, (int)state_buf.st_ino); } } printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? "y":"n"); fcntl(2, F_SETFL, O_NONBLOCK); printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? "y":"n"); return 0; } I'm sshed into my server, and the output of the above is, for me: 0: 25/3 1: 25/3 2: 25/3 NON BLOCKING: n NON BLOCKING: y My stdin, stdout, and stderr is the same dev/ino, and setting non-blocking mode one any one of them puts them all in non-blocking mode. I'm setting non- blocking on standard error, and my standard input also becomes non-blocking. After sifting through sources, and straces, this is what I see is happening. Starting with make, and how it sets up its job file descriptors: jobserver_parse_auth (const char *auth) { /* Given the command-line parameter, parse it. */ if (sscanf (auth, "%d,%d", &job_fds[0], &job_fds[1]) != 2) // … /* Make sure our pipeline is valid, and (possibly) create a duplicate pipe, that will be closed in the SIGCHLD handler. If this fails with EBADF, the parent has closed the pipe on us because it didn't think we were a submake. If so, warn and default to -j1. */ if (!FD_OK (job_fds[0]) || !FD_OK (job_fds[1]) || make_job_rfd () < 0) } The TLDR of the above: make reads the job server's file descriptors from the MAKEFLAGS environment variable, then checks here if they actually exist. If they don't exist, make will create the job server pipe. Important: by default they will be file descriptors 3 and 4. This becomes a key player in this mystery, a little bit later. When make spawns a child job (other than a recursive make) the job server file descriptors get closed (marked O_CLOEXEC before the actual execve), but MAKEFLAGS remains in the environment. They remain open for a recursive make call, and the recursive instance picks up the ball from here. The problem with this approach is that if, somehow, a child process runs make, and there happens to be a pair of file descriptors here, make will assume that they're the job server file descriptors. Another thing that make does is that it sets one of the file descriptors to non-blocking mode. This is in jobserver_setup: /* When using pselect() we want the read to be non-blocking. */ set_blocking (job_fds[0], 0); Now we get to ccache which, amongst things, does the following, in its ccache.cpp: // Make a copy of stderr that will not be cached, so things like distcc can // send networking errors to it. static void set_up_uncached_err() { int uncached_fd = dup(STDERR_FILENO); // The file descriptor is intentionally leaked. if (uncached_fd == -1) { log("dup(2) failed: {}", strerror(errno)); throw Failure(Statistic::internal_error); } Util::setenv("UNCACHED_ERR_FD", fmt::format("{}", uncached_fd)); } TLDR: it intentionally leaks a file descriptor. A basic standalone strace of a standalone cache invocation showed ccache leaking file descriptor 2 to 3. Now, somewhere else I was also getting file descriptor 4, I haven't tracked down where, exactly. But it was there, The final ingredient to this perfect storm, is LTO linking. I have verified that -flto=auto -ffat-lto-objects results in ld running make. If you compile helloworld, or something, with -flto=auto -ffat-lto-objects, and link it: strace -s 256 -o z -f gcc -flto=auto -ffat-lto-objects -o t t.o strace will show that, somehow, make gets involved: 123302 execve("/usr/bin/make", ["make", "-f", "/tmp/ccyRcNmv.mk", "-j32", "all"], 0x234ed40 /* 47 vars */ <unfinished …> And now, let's go back to what make is doing, here: static void close_stdout (void) { int prev_fail = ferror (stdout); int fclose_fail = fclose (stdout); if (prev_fail || fclose_fail) { if (fclose_fail) perror_with_name (_("write error: stdout"), ""); else O (error, NILF, _("write error: stdout")); exit (MAKE_TROUBLE); } } So, the grand total, this is what was happening when I was building a large hairball, using make, distcc, cache, and when using LTO: 1) The original instance of make came up, and set up a job server on file descriptors 3 and 4, and put them into MAKEFLAGS. 2) This was a large build, in recursive directories, with recursive make invocations. 3) A link command was executed, invoking ccache. Before exec-ing it, make set CLOEXEC on both file descriptors 3 and 4, and ccache did not inherit them. 4) ccache duped file descriptor 2 to 3, and invoked distcc, which then invoked the real gcc to finally do the LTO link. At some point along the line something got leaked to file descriptor 4. That's the only piece of the puzzle I've yet to track down, but it didn't seem like it was very important to figure outwhere it came from. 5) gcc invoked the linker, which invoking make again for LTO. 6) make reads MAKEFLAGS, and sees that it should have job server file descriptors. make sees something on file descriptors 3 and 4, thinks its the original job server file descriptors (spoiler: they aren't). 7) this clusterfark wasn't fatal enough to blow up this link. It succeeded, but: set_blocking (job_fds[0], 0); put the fake job file descriptor in non-blocking mode. This file descriptor was really the original terminal standard error, courtesy of ccache. Now, both standard input, output, and error, are in non-blocking mode. 8) My original build was -j 40. They rolled merrily along, spewing tons of output, a good chunk of it was from make itself. One of make's write()s to standard output was EAGAIN-ed. This resulted in the original "write error: stdout" that prompted me to burn almost a month chasing this down. So, now what? Both ccache, and the linker are contributing to this situation. ccache leaks the file descriptor (to distcc apparently?). The linker invokes make for its own use, but it doesn't sanitize the environment. make closes the job server file descriptors, but not removing them from MAKEFLAGS. In the meantime, -Orecursive seems to be a temporary bandaid. I think it works around this frankenbug. _______________________________________________ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache