Hi, On Thu, Nov 26, 2020 at 02:09:33PM +0100, Alban Crequy wrote: > Hi, > > With the addfd feature (added in “seccomp: Introduce addfd ioctl to > seccomp user notifier”, commit 7cf97b125455), the new file is > installed in the target process during the SECCOMP_IOCTL_NOTIF_ADDFD > operation and not at the end with the SECCOMP_IOCTL_NOTIF_SEND > operation. This can cause race conditions when the target process is > interrupted by a signal (EINTR) and restarted automatically. > > This is more noticeable in multithreaded processes like with Golang. > In Golang 1.14: > https://golang.org/doc/go1.14 > > "A consequence of the implementation of preemption is that on Unix systems, > > including Linux and macOS systems, programs built with Go 1.14 will receive > > more signals than programs built with earlier releases. This means that > > programs that use packages like syscall or golang.org/x/sys/unix will see > > more slow system calls fail with EINTR errors. Those programs will have to > > handle those errors in some way, most likely looping to try the system call > > again." > > In my test, I added a seccomp policy which returns > SECCOMP_RET_USER_NOTIF on execve() and I added a sleep(2) in the > seccomp agent (using https://github.com/kinvolk/seccompagent/) between > SECCOMP_IOCTL_NOTIF_RECV and SECCOMP_IOCTL_NOTIF_SEND to make it a bit > slow to reply with SECCOMP_USER_NOTIF_FLAG_CONTINUE. I got the > following strace log going on in a loop: > > [pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"], > 0xc000063b00 /* 11 vars */ <unfinished ...> > [pid 2656200] <... nanosleep resumed>NULL) = 0 > [pid 2656200] epoll_pwait(7, [], 128, 0, NULL, 0) = 0 > [pid 2656200] getpid() = 1 > [pid 2656200] tgkill(1, 1, SIGURG) = 0 > [pid 2656199] <... execve resumed>) = ? ERESTARTSYS (To be > restarted if SA_RESTART is set) > [pid 2656200] nanosleep({tv_sec=0, tv_nsec=10000000}, <unfinished ...> > [pid 2656199] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, > si_uid=0} --- > [pid 2656199] rt_sigreturn({mask=[]}) = 59 > [pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"], > 0xc000063b00 /* 11 vars */ <unfinished ...> > > On the seccomp agent side, the ioctl(SECCOMP_IOCTL_NOTIF_SEND) returns > ENOENT, and then it receives the same notification at the next > iteration of the loop. > > The SIGURG signal is sent by the Golang runtime, causing the execve to > be interrupted, and restarted automatically, triggering the new > seccomp notification. In this example with execve, this is not a big > deal because the seccomp agent doesn't add a fd. But on a open() or > accept() syscall, I fear that the seccomp agent could install a file > descriptor without knowing that the syscall will be interrupted soon > after, but before the SECCOMP_IOCTL_NOTIF_SEND is completed. > > I understand the need to have two different ioctl() to add the fd and > to reply to the seccomp notification because the seccomp agent needs > to know the fd number being assigned before specifying the return > value of the syscall with that number. > > What do you think is the best way to solve this problem? Here are a few ideas: > > - Idea 1: add a second flag for the struct seccomp_notif_resp > “SECCOMP_USER_NOTIF_FLAG_RETURN_FD” to instruct seccomp to override > the return value with the first fd to install. It would not help to > emulate recvfrom() with SCM_RIGHTS but it will solve the problem for > syscalls that return a fd because we can then implement a new ioctl > (“SECCOMP_IOCTL_NOTIF_SEND_WITH_FDS”?) that does the addfd and the > notification response in one step. > > Other ideas but they cause more problems: > > - Idea 2: We need some kind of transactions where the fd is sent with > the first ioctl() and installed in the fd table but marked somehow to > be closed automatically if the syscall is interrupted with EINTR > outside of the control of the seccomp agent. The new fd in the fd > table would be committed at the end if the syscall is not interrupted. > But this introduces other issues: another thread could call dup() on > the fd before it gets closed. Or another process sharing the fd table > with CLONE_FILES could do the same. Should the not-yet-committed fds > be visible in /proc/<pid>/fd/? Or inherited to new processes created > by fork()? > > - Idea 3: We could add fds in a temporary location but not in the > `struct files_struct` of the target process, and only commit at > SECCOMP_IOCTL_NOTIF_SEND time. In this way, threads or processes > sharing the fd table with CLONE_FILES would not be impacted. However, > this could open new race conditions if other threads are installing > fds in the same slots in the fd table. Also, this seems quite > dangerous to add this concept of "inflight" fd for seccomp because > there are already inflight fds for SCM_RIGHT and a garbage collector > to clean circular references (net/unix/garbage.c). If we add an > inflight fd mechanism on seccomp, a malicious user could just use > SECCOMP_IOCTL_NOTIF_ADDFD to send a unix socket that has the > seccomp-fd inflight in SCM_RIGHT. Then, the malicious seccomp agent > would close(seccompFd) and we will be in a situation where both the > seccomp-fd and the unix socket are not attached to any process but > they reference each other, so they cannot be closed. > > What do you think? Is there a better solution?
Idea 1 sounds best to me, but maybe that's because it's the way I originally did the fd support that never landed :) But here's an Idea 4: we add a way to remotely close an fd (I don't see that the current infra can do this, but perhaps I didn't look hard enough), and then when you get ENOENT you have to close the fd. Of course, this can't be via seccomp, so maybe it's even more racy. Tycho