On Sun, 7 Feb 2021 at 12:26, Sam Varshavchik via ccache <ccache@lists.samba.org> wrote:
> I've been chasing down non-deterministic, random build fails. Thanks for the thorough investigation! > So, now what? Both ccache, and the linker are contributing to this situation. > ccache leaks the file descriptor (to distcc apparently?). Yes, or rather for anyone who wants to use it. Icecream also uses it. > The linker invokes make for its own use, but it doesn't sanitize the > environment. make closes the job server file descriptors, but not removing > them from MAKEFLAGS. >From my perspective, I don't see anything wrong with ccache dup-ing stderr – I guess the bug can be triggered by the user running something like exec 3>&2 in the shell as well. I can think of an easy workaround for the problem: only dup2 stderr when running the preprocessor or compiler. Done in <https://github.com/ccache/ccache/commit/c4284c78e9ff286ee5208f449af4c2aaba062d37>. -- Joel On Sun, 7 Feb 2021 at 12:26, Sam Varshavchik via ccache <ccache@lists.samba.org> wrote: > > I've been chasing down non-deterministic, random build fails. make > randomly blows up with > > write error: stdout > > A lot of stracing, experimentation, and pains, determined that this is > due to a sequence of the following events: > > 1) how make uses the job server file descriptor > > 2) an intentional leak of the standard error file descriptor in ccache > > 3) LTO linking results in linker recursively invoking make (that was a > new one to me) > > 4) A rather nuanced behavior of the Linux PTY driver that can be best > explained with the following short demo: > > #include <sys/types.h> > #include <fcntl.h> > #include <stdio.h> > #include <unistd.h> > #include <sys/stat.h> > > int main() > { > struct stat state_buf; > > for (int i=0; i<8; ++i) > { > if (fstat(i, &state_buf) == 0) > { > printf("%d: %d/%d\n", i, > (int)state_buf.st_dev, > (int)state_buf.st_ino); > } > } > > printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? > "y":"n"); > > fcntl(2, F_SETFL, O_NONBLOCK); > > printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? > "y":"n"); > return 0; > } > > I'm sshed into my server, and the output of the above is, for me: > > 0: 25/3 > 1: 25/3 > 2: 25/3 > NON BLOCKING: n > NON BLOCKING: y > > My stdin, stdout, and stderr is the same dev/ino, and setting non-blocking > mode one any one of them puts them all in non-blocking mode. I'm setting non- > blocking on standard error, and my standard input also becomes non-blocking. > > After sifting through sources, and straces, this is what I see is happening. > > Starting with make, and how it sets up its job file descriptors: > > jobserver_parse_auth (const char *auth) > { > /* Given the command-line parameter, parse it. */ > if (sscanf (auth, "%d,%d", &job_fds[0], &job_fds[1]) != 2) > > // … > > /* Make sure our pipeline is valid, and (possibly) create a duplicate pipe, > that will be closed in the SIGCHLD handler. If this fails with EBADF, > the parent has closed the pipe on us because it didn't think we were a > submake. If so, warn and default to -j1. */ > > if (!FD_OK (job_fds[0]) || !FD_OK (job_fds[1]) || make_job_rfd () < 0) > } > > The TLDR of the above: make reads the job server's file descriptors > from the MAKEFLAGS environment variable, then checks here if they > actually exist. If they don't exist, make will create the job server > pipe. Important: by default they will be file descriptors 3 and 4. > This becomes a key player in this mystery, a little bit later. > > When make spawns a child job (other than a recursive make) the job > server file descriptors get closed (marked O_CLOEXEC before the actual > execve), but MAKEFLAGS remains in the environment. They remain open > for a recursive make call, and the recursive instance picks up the > ball from here. > > The problem with this approach is that if, somehow, a child process > runs make, and there happens to be a pair of file descriptors here, > make will assume that they're the job server file descriptors. > > Another thing that make does is that it sets one of the file > descriptors to non-blocking mode. This is in jobserver_setup: > > /* When using pselect() we want the read to be non-blocking. */ > set_blocking (job_fds[0], 0); > > Now we get to ccache which, amongst things, does the following, in its > ccache.cpp: > > // Make a copy of stderr that will not be cached, so things like distcc can > // send networking errors to it. > static void > set_up_uncached_err() > { > int uncached_fd = > dup(STDERR_FILENO); // The file descriptor is intentionally leaked. > if (uncached_fd == -1) { > log("dup(2) failed: {}", strerror(errno)); > throw Failure(Statistic::internal_error); > } > > Util::setenv("UNCACHED_ERR_FD", fmt::format("{}", uncached_fd)); > } > > TLDR: it intentionally leaks a file descriptor. A basic standalone > strace of a standalone cache invocation showed ccache leaking file > descriptor 2 to 3. Now, somewhere else I was also getting file > descriptor 4, I haven't tracked down where, exactly. But it was there, > > The final ingredient to this perfect storm, is LTO linking. I have > verified that -flto=auto -ffat-lto-objects results in ld running make. > If you compile helloworld, or something, with -flto=auto > -ffat-lto-objects, and link it: > > strace -s 256 -o z -f gcc -flto=auto -ffat-lto-objects -o t t.o > > strace will show that, somehow, make gets involved: > > 123302 execve("/usr/bin/make", ["make", "-f", "/tmp/ccyRcNmv.mk", "-j32", > "all"], 0x234ed40 /* 47 vars */ <unfinished …> > > And now, let's go back to what make is doing, here: > > static void > close_stdout (void) > { > int prev_fail = ferror (stdout); > int fclose_fail = fclose (stdout); > > if (prev_fail || fclose_fail) > { > if (fclose_fail) > perror_with_name (_("write error: stdout"), ""); > else > O (error, NILF, _("write error: stdout")); > exit (MAKE_TROUBLE); > } > } > > So, the grand total, this is what was happening when I was building a > large hairball, using make, distcc, cache, and when using LTO: > > 1) The original instance of make came up, and set up a job server on > file descriptors 3 and 4, and put them into MAKEFLAGS. > > 2) This was a large build, in recursive directories, with recursive > make invocations. > > 3) A link command was executed, invoking ccache. Before exec-ing it, > make set CLOEXEC on both file descriptors 3 and 4, and ccache did not > inherit them. > > 4) ccache duped file descriptor 2 to 3, and invoked distcc, which then > invoked the real gcc to finally do the LTO link. At some point along > the line something got leaked to file descriptor 4. That's the only > piece of the puzzle I've yet to track down, but it didn't seem like it > was very important to figure outwhere it came from. > > 5) gcc invoked the linker, which invoking make again for LTO. > > 6) make reads MAKEFLAGS, and sees that it should have job server file > descriptors. make sees something on file descriptors 3 and 4, thinks its the > original job server file descriptors (spoiler: they aren't). > > 7) this clusterfark wasn't fatal enough to blow up this link. It succeeded, > but: > > set_blocking (job_fds[0], 0); > > put the fake job file descriptor in non-blocking mode. This file > descriptor was really the original terminal standard error, courtesy > of ccache. Now, both standard input, output, and error, are in > non-blocking mode. > > 8) My original build was -j 40. They rolled merrily along, spewing > tons of output, a good chunk of it was from make itself. One of make's > write()s to standard output was EAGAIN-ed. This resulted in the > original "write error: stdout" that prompted me to burn almost a month > chasing this down. > > So, now what? Both ccache, and the linker are contributing to this > situation. ccache leaks the file descriptor (to distcc apparently?). > The linker invokes make for its own use, but it doesn't sanitize the > environment. make closes the job server file descriptors, but not > removing them from MAKEFLAGS. > > In the meantime, -Orecursive seems to be a temporary bandaid. I think > it works around this frankenbug. > > _______________________________________________ > ccache mailing list > ccache@lists.samba.org > https://lists.samba.org/mailman/listinfo/ccache _______________________________________________ ccache mailing list ccache@lists.samba.org https://lists.samba.org/mailman/listinfo/ccache