On Tue, Feb 13, 2018 at 9:31 AM, Sargun Dhillon <sar...@sargun.me> wrote: > On Tue, Feb 13, 2018 at 9:02 AM, Jessie Frazelle <m...@jessfraz.com> wrote: >> On Tue, Feb 13, 2018 at 11:29 AM, Sargun Dhillon <sar...@sargun.me> wrote: >>> On Tue, Feb 13, 2018 at 7:47 AM, Kees Cook <keesc...@chromium.org> wrote: >>>> What's the reason for adding eBPF support? seccomp shouldn't need it, >>>> and it only makes the code more complex. I'd rather stick with -- cBPF >>>> until we have an overwhelmingly good reason to use eBPF as a "native" >>>> seccomp filter language. >>>> >>> Three reasons: >>> 1) The userspace tooling for eBPF is much better than the user space >>> tooling for cBPF. Our use case is specifically to optimize Docker >>> policies. This is roughly what their seccomp policy looks like: >>> https://github.com/moby/moby/blob/master/profiles/seccomp/default.json. >>> It would be much nicer to be able to leverage eBPF to write this in C, >>> or any other the other languages targetting eBPF. In addition, if we >>> have write-only maps, we can exfiltrate information from seccomp, like >>> arguments, and errors in a relatively cheap way compared to cBPF, and >>> then extract this via the bcc stack. Writing cBPF via C macros is a >>> pain, and the off the shelf cBPF libraries are getting no love. The >>> eBPF community is *exploding* with contributions.
eBPF moving quickly is a disincentive from my perspective, as I want absolutely zero surprises when it comes to seccomp. :) Given the steady stream of exploitable flaws in eBPF, I don't want seccomp anywhere near it. :( Many distros ship with the bpf() syscall disabled, for example (or entirely compiled out, as in Chrome OS and Android). The convenience of writing C for eBPF output is certainly nice, but it seems like either LLVM could grow a cBPF backend, or libseccomp could be improved to provide the needed features. Can you explain the exfiltration piece? Do you mean it would be "cheap" in the sense that the results can be stored and studied without needing a ptrace manager to catch the failures? I remain unconvinced that seccomp needs a more descriptive language, given its limited usage. > A really naive approach is to take the JSON seccomp policy document > and converting it to plain old C with switch / case statements. Then > we can just push that through LLVM and we're in business. Although, > for some reason, I don't think the folks will want to take a hard dep > on llvm at runtime, so maybe there's some mechanism where it first > tries llvm, then tries to create a eBPF application naively, and then > falls back to cBPF. My primary fear with the first two approaches is > that given how the policies are written today, it's not conducive to > the eBPF instruction limit. How about having libseccomp grow a JSON parser? >>> 2) In my testing, which thus so far has been very rudimentary, with >>> rewriting the policy that libseccomp generates from the Docker policy >>> to use eBPF, and eBPF maps performs much better than cBPF. The >>> specific case tested was to use a bpf array to lookup rules for a >>> particular syscall. In a super trivial test, this was about 5% low >>> latency than using traditional branches. If you need more evidence of >>> this, I can work a little bit more on the maps related patches, and >>> see if I can get some more benchmarking. From my understanding, we >>> would need to add "sealing" support for maps, in which they can be >>> marked as read-only, and only at that point should an eBPF seccomp >>> program be able to read from them. This came up recently on the libseccomp mailing list. The map lookup is faster than a linear search, but for large filters, the filter can be written as a balanced tree (as Chrome does), or reordered by syscall frequency (as is recommended by minijail), and that appears to get a much larger improvement than even the map lookup. >>> 3) Eventually, I'd like to use some more advanced capabilities of >>> eBPF, like being able to rewrite arguments safely (not things referred >>> to by pointers, but just plain old arguments). Much like 1), I don't find this an incentive, as the interactions become much harder to reason about, and I am concerned we'll open seccomp up to attack for a relatively small benefit. However, rewriting arguments has come up in very narrow cases, and Tycho was working on a method of doing userspace notifications (i.e. without a ptrace manager) to get us closer. If the needs Tycho outlined could be addressed fully with eBPF, and we can very narrowly scope the use of the "extra" eBPF features, I might be more inclined to merge something like this, but I want to take it very carefully. Besides creating a dependency on the bpf() syscall, this would create side channels (via maps) that make me very uncomfortable when dealing with process isolation. (Though, in theory, this is already correctly constrained by no-new-privs...) Tycho, could you get what you needed from eBPF? My impression would be that you'd still need a user notification mechanism to stop the process, as the decisions about how to rewrite arguments likely cannot be fully characterized by the internal eBPF filter. -Kees  https://patchwork.kernel.org/patch/10199295/ -- Kees Cook Pixel Security