On Sun, Feb 4, 2018 at 10:49 AM, Tycho Andersen <ty...@tycho.ws> wrote: > This patch introduces a means for syscalls matched in seccomp to notify > some other task that a particular filter has been triggered.
Neat! > > The motivation for this is primarily for use with containers. For example, > if a container does an init_module(), we obviously don't want to load this > untrusted code, which may be compiled for the wrong version of the kernel > anyway. Instead, we could parse the module image, figure out which module > the container is trying to load and load it on the host. > > As another example, containers cannot mknod(), since this checks > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard > coding some whitelist in the kernel. Another example is mount(), which has > many security restrictions for good reason, but configuration or runtime > knowledge could potentially be used to relax these restrictions. > > This patch adds functionality that is already possible via at least two > other means that I know about, both of which involve ptrace(): first, one > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL. > Unfortunately this is slow, so a faster version would be to install a > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP. > Since ptrace allows only one tracer, if the container runtime is that > tracer, users inside the container (or outside) trying to debug it will not > be able to use ptrace, which is annoying. It also means that older > distributions based on Upstart cannot boot inside containers using ptrace, > since upstart itself uses ptrace to start services. > > The actual implementation of this is fairly small, although getting the > synchronization right was/is slightly complex. Also worth noting that there > is one race still present: > > 1. a task does a SECCOMP_RET_USER_NOTIF > 2. the userspace handler reads this notification > 3. the task dies > 4. a new task with the same pid starts > 5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id > that the previous one did > 6. the userspace handler writes a response I'm slightly confused. I thought the id was never reused for a given struct seccomp_filter. (Also, shouldn't the id be u64, not u32?) On very quick reading, I have a question. What happens if a process has two seccomp_filters attached, one of them returns SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?