Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition

Tycho Andersen Mon, 30 Nov 2020 15:23:18 -0800

Hi,

On Thu, Nov 26, 2020 at 02:09:33PM +0100, Alban Crequy wrote:
> Hi,
> 
> With the addfd feature (added in “seccomp: Introduce addfd ioctl to
> seccomp user notifier”, commit 7cf97b125455), the new file is
> installed in the target process during the SECCOMP_IOCTL_NOTIF_ADDFD
> operation and not at the end with the SECCOMP_IOCTL_NOTIF_SEND
> operation. This can cause race conditions when the target process is
> interrupted by a signal (EINTR) and restarted automatically.
> 
> This is more noticeable in multithreaded processes like with Golang.
> In Golang 1.14:
> https://golang.org/doc/go1.14
> > "A consequence of the implementation of preemption is that on Unix systems, 
> > including Linux and macOS systems, programs built with Go 1.14 will receive 
> > more signals than programs built with earlier releases. This means that 
> > programs that use packages like syscall or golang.org/x/sys/unix will see 
> > more slow system calls fail with EINTR errors. Those programs will have to 
> > handle those errors in some way, most likely looping to try the system call 
> > again."
> 
> In my test, I added a seccomp policy which returns
> SECCOMP_RET_USER_NOTIF on execve() and I added a sleep(2) in the
> seccomp agent (using https://github.com/kinvolk/seccompagent/) between
> SECCOMP_IOCTL_NOTIF_RECV and SECCOMP_IOCTL_NOTIF_SEND to make it a bit
> slow to reply with SECCOMP_USER_NOTIF_FLAG_CONTINUE. I got the
> following strace log going on in a loop:
> 
> [pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"],
> 0xc000063b00 /* 11 vars */ <unfinished ...>
> [pid 2656200] <... nanosleep resumed>NULL) = 0
> [pid 2656200] epoll_pwait(7, [], 128, 0, NULL, 0) = 0
> [pid 2656200] getpid()                  = 1
> [pid 2656200] tgkill(1, 1, SIGURG)      = 0
> [pid 2656199] <... execve resumed>)     = ? ERESTARTSYS (To be
> restarted if SA_RESTART is set)
> [pid 2656200] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
> [pid 2656199] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1,
> si_uid=0} ---
> [pid 2656199] rt_sigreturn({mask=[]})   = 59
> [pid 2656199] execve("/bin/sh", ["sh", "-c", "sleep infinity"],
> 0xc000063b00 /* 11 vars */ <unfinished ...>
> 
> On the seccomp agent side, the ioctl(SECCOMP_IOCTL_NOTIF_SEND) returns
> ENOENT, and then it receives the same notification at the next
> iteration of the loop.
> 
> The SIGURG signal is sent by the Golang runtime, causing the execve to
> be interrupted, and restarted automatically, triggering the new
> seccomp notification. In this example with execve, this is not a big
> deal because the seccomp agent doesn't add a fd. But on a open() or
> accept() syscall, I fear that the seccomp agent could install a file
> descriptor without knowing that the syscall will be interrupted soon
> after, but before the SECCOMP_IOCTL_NOTIF_SEND is completed.
> 
> I understand the need to have two different ioctl() to add the fd and
> to reply to the seccomp notification because the seccomp agent needs
> to know the fd number being assigned before specifying the return
> value of the syscall with that number.
> 
> What do you think is the best way to solve this problem? Here are a few ideas:
> 
> - Idea 1: add a second flag for the struct seccomp_notif_resp
> “SECCOMP_USER_NOTIF_FLAG_RETURN_FD” to instruct seccomp to override
> the return value with the first fd to install. It would not help to
> emulate recvfrom() with SCM_RIGHTS but it will solve the problem for
> syscalls that return a fd because we can then implement a new ioctl
> (“SECCOMP_IOCTL_NOTIF_SEND_WITH_FDS”?) that does the addfd and the
> notification response in one step.
> 
> Other ideas but they cause more problems:
> 
> - Idea 2: We need some kind of transactions where the fd is sent with
> the first ioctl() and installed in the fd table but marked somehow to
> be closed automatically if the syscall is interrupted with EINTR
> outside of the control of the seccomp agent. The new fd in the fd
> table would be committed at the end if the syscall is not interrupted.
> But this introduces other issues: another thread could call dup() on
> the fd before it gets closed. Or another process sharing the fd table
> with CLONE_FILES could do the same. Should the not-yet-committed fds
> be visible in /proc/<pid>/fd/? Or inherited to new processes created
> by fork()?
> 
> - Idea 3: We could add fds in a temporary location but not in the
> `struct files_struct` of the target process, and only commit at
> SECCOMP_IOCTL_NOTIF_SEND time. In this way, threads or processes
> sharing the fd table with CLONE_FILES would not be impacted. However,
> this could open new race conditions if other threads are installing
> fds in the same slots in the fd table. Also, this seems quite
> dangerous to add this concept of "inflight" fd for seccomp because
> there are already inflight fds for SCM_RIGHT and a garbage collector
> to clean circular references (net/unix/garbage.c). If we add an
> inflight fd mechanism on seccomp, a malicious user could just use
> SECCOMP_IOCTL_NOTIF_ADDFD to send a unix socket that has the
> seccomp-fd inflight in SCM_RIGHT. Then, the malicious seccomp agent
> would close(seccompFd) and we will be in a situation where both the
> seccomp-fd and the unix socket are not attached to any process but
> they reference each other, so they cannot be closed.
> 
> What do you think? Is there a better solution?


Idea 1 sounds best to me, but maybe that's because it's the way I
originally did the fd support that never landed :)

But here's an Idea 4: we add a way to remotely close an fd (I don't
see that the current infra can do this, but perhaps I didn't look hard
enough), and then when you get ENOENT you have to close the fd. Of
course, this can't be via seccomp, so maybe it's even more racy.

Tycho

Re: SECCOMP_IOCTL_NOTIF_ADDFD race condition

Reply via email to