On Tue, 23 Apr 2013 15:03:49 +0100 Christian Seiler <christ...@iwakd.de> wrote:
> Hi there, > > I've found a problem that comes from the interaction between setns() > and glibc's implementation of fork() (at least on x86_64). It is > rather complicated to describe, and forgive me if my mail is very > long. I know of several ways how to address the problem but wanted to > have a discussion first before actually changing something. > > For starters, let's read some glibc source code, just because it's > so much fun. ;-) > > From nptl/sysdeps/unix/sysv/linux/fork.c (glibc source code, [1]) > > ------------------------------------------------------------------- > _IO_list_lock (); > > #ifndef NDEBUG > pid_t ppid = THREAD_GETMEM (THREAD_SELF, tid); > #endif > > /* ... stuff left out, including syscall to fork itself ... */ > > if (pid == 0) > { > struct pthread *self = THREAD_SELF; > > assert (THREAD_GETMEM (self, tid) != ppid); > > /* ... stuff left out ... */ > > /* Reset locks in the I/O code. */ > _IO_list_resetlock (); > ------------------------------------------------------------------- > > The problem here is the following: Before the actual kernel syscall > fork, glibc's fork() function (which is called from within lxc-attach) > first locks all stdio stuff (for thread safety, because forking with > multiple threads can lead to trouble otherwise), then retrieves the > current thread's thread id (which is equivalent to the process id in > single-threaded applications) and then does the actual fork(). > > Then, if the return value of the kernel is 0 (i.e. we are the child > process of the fork), it retrieves the current thread id again and > then does an assert() that seems trivial, namely that the current > thread id (i.e. process id of the child process) is not equal to the > process id of the parent process it gathered earlier. > > This is when things fall apart with setns(): The current lxc-attach > logic is to do one fork(), then do a setns() in the child and do > another fork() and exec() in the grandchild, the parent and > grandparent both waiting for the child to exit. The first fork() is > irrelevant for this discussion (it is only there for user namespaces > anyway, which don't change anything here), but the main problem is > the following basic logic: > > 1. setns() > 2. fork() > > Because you can have the situation where the initial process has a > thread id of 42. The setns() call for the pid namespace doesn't > actually change the process/thread id (this is by design), it only > makes sure that all children of the current process are in the new pid > namespace (and then have a PPID of 0 inside the pidns). > > Now it gets interesting: fork() creates a said child. The pid of that > child in the root namespace (in which the pid of the parent resides) > is obviously not going to be 42, but most likely 43 or so. However, in > the child namespace, where the child will call the kernel and ask for > its own pid/tid, it may be the case that the child receives the pid > 42! This means that the tid/pid of the child will be 42, and also the > ppid variable in the glibc code will be 42 and then that assert() > will fail! > > This means that even though attach worked perfectly, fork() may fail > spuriously if there is a pid collision. (And it's only glibc's fork(), > not the kernel's, that one still works fine!) > > Now it gets even worse: if assert() fails, it wants to do a printf(). > The problem is that the _IO_list_lock() call before the fork(), which > needs to be there for thread safety, locks all I/O, so printf() > deadlocks. That means that the child process of lxc-attach hangs > because the assert can never cause the process to exit and the parent > process waits forever (it doesn't know the process had some problem, > becaues one would expect fork() to either work or return with an > error, not deadlock). > > The following example shows how the problem manifests itself in real > life: > > root@host:~# ps ax | grep attach > 16225 pts/12 S+ 0:00 lxc-attach -n container -- test -d / > 16226 pts/12 S+ 0:00 lxc-attach -n container -- test -d / > 16227 pts/12 S+ 0:00 lxc-attach -n container -- test -d / > root@host:~# pstree -alp 16225 > lxc-attach,16225 -n container -- test -d / > └─lxc-attach,16226 -n container -- test -d / > └─lxc-attach,16227 -n container -- test -d / > > Separate session: > > root@container:~# ps ax | grep attach > 16226 ? S+ 0:00 lxc-attach -n container -- test -d / > > The process 16226 inside the container corresponds to the process > 16227 outside the container. As you can see, the pid of the process > inside is the same as the parent's pid outside, which, according to my > explanation above, causes the deadlock. > > If I run lxc-attach again with the same parameters on the host, it > will easily succeed, unless I run into the same problem again by > chance. > > To reproduce the problem in order to see what was actually going on > and to be able to debug this and come up with the above explanation, > I ran a script in an endless loop trying to do the above command > again and again until it finally deadlocked. > > Killing the process will terminate lxc-attach successfully, but make > it fail. > > The last bit with the deadlock I see as a clear problem in glibc(), > since the assert() will deadlock either way, regardless of the reason. > I will report that bug later. (Also note that if using -DNDEBUG the > code will fail to compile due to a missing variable declaration, there > should be additional #ifdefs around the asserts(), although this leads > me to believe that nobody has compiled glibc with -DNDEBUG in at least > six years, otherwise this would have surfaced much, much earlier.) > > The question is: What do we do to resolve the other problem? I don't > think the assert in glibc() in an by itself should necessarily go > away, because in the vast, vast, vast majority of cases the check is > very reasonable. > > The long-term solution for glibc could be to also wrap setns() to > remember if it was called to enter a new pidns and in that case not to > do the check, but otherwise still do it. > > But we don't know against what version of glibc lxc-attach is going to > be build, so we need a better solution in the short term: Hi Christian, I agree we need a short term solution and one that works with existing glibcs. My question is if you think that glibc will eventually fix this at some point? I pulled down the most recent core-utils to see if/how they had worked around it but it looks to me like nsenter could possibly run into the same thing unless -F is given. I can try to reproduce the hang with nsenter, which might help encourage glibc to fix it sooner :) > - Call the fork syscall directly via syscall(__NR_fork) > Since lxc-attach is single-threaded anyway, the things glibc does > aren't strictly needed, although the cache for getpid() is not > updated afterwards. > > - Use clone(), where apparently, if I read the glibc source > correctly, the wrapper doesn't use that logic. Makes life a bit more > complicated because we need an extra function and can't just check > the return value of a syscall. Advantages: glibc's clone() wrapper > does seem to update the pid cache according to the source [2] and > we could use CLONE_PARENT to keep only one process around (make > the middle one exit immediately) and make the synchronization logic a > bit simpler. > > - Do something else? > > The only possible complication could arrive from the fact that the > pthread_atfork() handlers are not called with either the fork() > syscall or the clone() function. This is relevant in cases where the > host uses e.g. nss-ldap. That being said, we do exec() away pretty > fast (and any open fds or such nss implementations will have > O_CLOEXEC set), but we will do that only after invoking getpwuid() to > get the shell if no command was specified. If a socket or the such is > still open then, nss could produce some weird effects since it's the > same socket as that of the parent process. On the other hand, > lxc-attach doesn't do nss lookups at all before fork()ing away, so > this is probably a non-issue. > > My suggestion would be to use clone() and to use that occasion to > simplify the synchronization logic between all of those processes by > using CLONE_PARENT. If I can get an agreement on this, I'll implement > this, but I wanted to hear some thoughts in advance. > > -- Christian > > [1] > http://www.sourceware.org/cgi-bin/cvsweb.cgi/libc/nptl/sysdeps/unix/sysv/linux/fork.c?rev=1.16&content-type=text/x-cvsweb-markup&cvsroot=glibc > [2] > http://www.sourceware.org/cgi-bin/cvsweb.cgi/libc/sysdeps/unix/sysv/linux/x86_64/clone.S?rev=1.7&content-type=text/x-cvsweb-markup&cvsroot=glibc > > > ------------------------------------------------------------------------------ > Try New Relic Now & We'll Send You this Cool Shirt > New Relic is the only SaaS-based application performance monitoring > service that delivers powerful full stack analytics. Optimize and > monitor your browser, app, & servers with just a few lines of code. > Try New Relic and get this awesome Nerd Life shirt! > http://p.sf.net/sfu/newrelic_d2d_apr > _______________________________________________ Lxc-devel mailing list > Lxc-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/lxc-devel ------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr _______________________________________________ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel