On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon <sar...@sargun.me> wrote: > On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keesc...@chromium.org> wrote: >> On Tue, Feb 13, 2018 at 7:42 AM, Sargun Dhillon <sar...@sargun.me> wrote: >>> This patchset enables seccomp filters to be written in eBPF. Although, >>> this patchset doesn't introduce much of the functionality enabled by >>> eBPF, it lays the ground work for it. >>> >>> It also introduces the capability to dump eBPF filters via the PTRACE >>> API in order to make it so that CHECKPOINT_RESTORE will be satisifed. >>> In the attached samples, there's an example of this. One can then use >>> BPF_OBJ_GET_INFO_BY_FD in order to get the actual code of the program, >>> and use that at reload time. >>> >>> The primary reason for not adding maps support in this patchset is >>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS. >>> If we have a map that the BPF program can read, it can potentially >>> "change" privileges after running. It seems like doing writes only >>> is safe, because it can be pure, and side effect free, and therefore >>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come >>> to an agreement, this can be in a follow-up patchset. >> >> What's the reason for adding eBPF support? seccomp shouldn't need it, >> and it only makes the code more complex. I'd rather stick with -- cBPF >> until we have an overwhelmingly good reason to use eBPF as a "native" >> seccomp filter language. >> >> -Kees >> > Three reasons: > 1) The userspace tooling for eBPF is much better than the user space > tooling for cBPF. Our use case is specifically to optimize Docker > policies. This is roughly what their seccomp policy looks like: > https://github.com/moby/moby/blob/master/profiles/seccomp/default.json. > It would be much nicer to be able to leverage eBPF to write this in C, > or any other the other languages targetting eBPF. In addition, if we > have write-only maps, we can exfiltrate information from seccomp, like > arguments, and errors in a relatively cheap way compared to cBPF, and > then extract this via the bcc stack. Writing cBPF via C macros is a > pain, and the off the shelf cBPF libraries are getting no love. The > eBPF community is *exploding* with contributions.
Is stage two of this getting runc to support eBPF and docker to change the default to be written as eBPF, because I foresee that being a problem mainly with the kernel versions people use. The point of that patch was to help the most people and as your point in (2) is made about performance, that is a trade-off I would be willing to make in order to have this functionality on more kernel versions. The other alternative would be to have docker translate to use eBPF if the kernel supported it, but that amount of complexity seems a bit unnecessary for a feature that was trying to also be "simple". Or do you plan on wrapping filters onto processes tangentially from the runtime, in which case, that should be totally fine :) Anyways this is kinda a tangent from the main point of getting it in the kernel, just I would hate to see someone having to maintain this without there being a path to getting it upstream elsewhere. > > 2) In my testing, which thus so far has been very rudimentary, with > rewriting the policy that libseccomp generates from the Docker policy > to use eBPF, and eBPF maps performs much better than cBPF. The > specific case tested was to use a bpf array to lookup rules for a > particular syscall. In a super trivial test, this was about 5% low > latency than using traditional branches. If you need more evidence of > this, I can work a little bit more on the maps related patches, and > see if I can get some more benchmarking. From my understanding, we > would need to add "sealing" support for maps, in which they can be > marked as read-only, and only at that point should an eBPF seccomp > program be able to read from them. > > 3) Eventually, I'd like to use some more advanced capabilities of > eBPF, like being able to rewrite arguments safely (not things referred > to by pointers, but just plain old arguments). > >>> >>> >>> Sargun Dhillon (3): >>> bpf, seccomp: Add eBPF filter capabilities >>> seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp >>> filters >>> bpf: Add eBPF seccomp sample programs >>> >>> arch/Kconfig | 7 ++ >>> include/linux/bpf_types.h | 3 + >>> include/linux/seccomp.h | 12 +++ >>> include/uapi/linux/bpf.h | 2 + >>> include/uapi/linux/ptrace.h | 5 +- >>> include/uapi/linux/seccomp.h | 15 ++-- >>> kernel/bpf/syscall.c | 1 + >>> kernel/ptrace.c | 3 + >>> kernel/seccomp.c | 185 >>> ++++++++++++++++++++++++++++++++++++++----- >>> samples/bpf/Makefile | 9 +++ >>> samples/bpf/bpf_load.c | 9 ++- >>> samples/bpf/seccomp1_kern.c | 17 ++++ >>> samples/bpf/seccomp1_user.c | 34 ++++++++ >>> samples/bpf/seccomp2_kern.c | 24 ++++++ >>> samples/bpf/seccomp2_user.c | 66 +++++++++++++++ >>> 15 files changed, 362 insertions(+), 30 deletions(-) >>> create mode 100644 samples/bpf/seccomp1_kern.c >>> create mode 100644 samples/bpf/seccomp1_user.c >>> create mode 100644 samples/bpf/seccomp2_kern.c >>> create mode 100644 samples/bpf/seccomp2_user.c >>> >>> -- >>> 2.14.1 >>> >> >> >> >> -- >> Kees Cook >> Pixel Security -- Jessie Frazelle 4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3 pgp.mit.edu