On 16.02.2017 10:32, Daniel P. Berrange wrote:
> On Thu, Feb 16, 2017 at 09:38:59AM +0100, Thomas Huth wrote:
>> On 15.02.2017 19:27, Daniel P. Berrange wrote:
>>> The current impl of seccomp in QEMU is intentionally allowing a huge range
>>> of system calls to be executed. The goal was that running '-sandbox on'
>>> should never break any feature of QEMU, so naturally any syscall that can
>>> executed on any codepath QEMU takes must be allowed.
>>> This is good for usability because users don't need to understand the
>>> details of the sandbox technology, they merely say "on" and it "just works".
>>> Conversely though, this is bad for security because QEMU has to allow a huge
>>> range of system calls to be used due to its broad functionality.
>>> During initial discussions for seccomp back in 2012 it was suggested, there
>>> might be alternate policies developed for QEMU which deny some features, but
>>> improve security overall. To best of my knowledge, this has never been
>>> again since then.
>>> In addition, since initially merging, there has been a steady stream of
>>> to whitelist further syscalls that were missing. Some of these were missing
>>> to newly added functionality in QEMU since the original seccomp impl, while
>>> others have been missing since day 1. It is reasonable to expect that there
>>> still many syscalls missing in the whitelist. In just a couple of minutes of
>>> comparing the whitelist vs global syscall list it was possible to identify
>>> further missing syscalls. The '-netdev bridge,br=virbr0' network backend
>>> because setuid is blocked, preventing execution of the qemu-bridge-helper
>>> program. If built against glibc < 2.9, or running on kernel < 2.6.27 it will
>>> fail to call eventfd() because we only permit eventfd2() syscall, not the
>>> older eventfd() syscall used on older Linux. Some ifup scripts used with the
>>> -netdev arg may also break due to lack of chmod, flock, getxattr
>>> This risk of missing syscalls is why -sandbox defaults to off, and we've
>>> considered defaulting it to on.
>>> The fundamental problem is that building a whitelist of syscalls used by
>>> emulators is an intractable problem. QEMU on my system links to 183
>>> shared libraries and there is no way in the world that anyone can figure out
>>> which code paths QEMU triggers in these libraries and thus identify which
>>> syscalls will be genuinely needed.
>>> Thus a whitelist based approach for QEMU is doomed to always be missing some
>>> syscalls, resulting in uneccessary abrts of QEMU when it tickles some edge
>>> case. If you are lucky the abort() happens at startup so you see it quickly
>>> and can address it. If you are unlucky the abort() happens after your VM has
>>> been running for days/week/months and you loose data.
>>> IOW, seccomp integration as it currently exists today in QEMU offers minimal
>>> security benefits, while at the same time causing spurious crashes which may
>>> cause user data loss from aborting a running VM, discouraging users from
>>> even the minimal protection it offers.
>>> I think we need to rework our seccomp support so that we can have a high
>>> level of confidence in it, that it could be enabled by default. At the same
>>> we need to make it do something more tangibly useful from a security POV.
>>> First we need to admit that whitelisting is a failed approach, and switch to
>>> using blacklisting. Unless we do this, we'll never have high enough
>>> to enable it by default - something that's never turned on might as well not
>>> exist at all.
>>> There is a reasonable easily identifiable set of syscalls that QEMU should
>>> never be permitted to use, no matter what configuration it is in, what
>>> it spawns, or what libraries it links to. eg reboot, swapon, swapoff,
>>> mount, unmount, kexec_*, etc - any syscall that affects global system state,
>>> rather than process local state should be forbidden.
>>> There are some syscalls that are simply hardcoded to return ENOSYS which can
>>> be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the
>>> man page 'unimplemented(2)').
>>> There are some syscalls which are considered obsolete - they were previously
>>> useful, but no modern code would call them, as they have been superceeded.
>>> For example, readdir replaced by getdents. We could blacklist these by
>>> but provide a way to allow use of obsolete syscalls if running on older
>>> e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we
>>> to just block them permanently with no opt in - would need to analyse when
>>> their replacements appeared in widespread use.
>>> There might be a few more syscalls which we can determine are never valid to
>>> use in QEMU or any library or helper program it might run. I expect this
>>> to be very small though, given the impossibility of auditing code paths
>>> millions of lines of code QEMU links to.
>>> Everything else should be allowed.
>>> At this point we have a highly reliable "-sandbox on" which we're not having
>>> to constantly patch.
>>> From here we need a way to allow a user to opt-in to more restrictive
>>> accepting that it will block certain features. For example, there should be
>>> a way to disable any means to elevate privileges from QEMU or things it
>>> e.g. '-sandbox on,elevateprivileges=deny'.
>>> This would not only block the variuous set*uid|gid functions via seccomp,
>>> should also prctl(PR_SET_NO_NEW_PRIVS). This would allows the user to optin
>>> a restrictive world if they know they'll not require things like the setuid
>>> bridge helper.
>>> Similarly there should be an '-sandbox on,spawn=deny' which prevents the
>>> to fork/exec processes at all, whether privileged or not. This would block
>>> features like the qemu bridge helper, SMB server, ifup/down scripts,
>>> exec: protocol. These are all rarely used features though, so an opt-in to
>>> their use is reasonable & desirable.
>>> A -sandbox on,resourcecontrol=deny, which prevents QEMU from setting stuff
>>> process affinity, schedular priority, etc. Some uses of QEMU might need
>>> but normally such controls are left to the mgmt app above QEMU to set prior
>>> the exec() of QEMU.
>> I like your proposal! I just wanted to add an idea for an additional
>> parameter (not sure whether it is feasible, though): Something like
>> "-sandbox on,network=off" ... i.e. forbid all system calls that are used
>> for networking. Rationale: Sometimes your VM does not need any
>> networking, and you want to make sure that a malicious guest can also
>> not reach your local network in that case.
> This is pretty tricky. Even if there is not obviously configured network
> backend in QEMU, there's plenty of scope for things in libraries to
> be using networking. Something want a fully qualified hostname ? That'll
> trigger UDP / TCP connections to a DNS resolver. Running with the SDL
> or GTK display frontends - those use networking over UNIX sockets to
> talk to a display server. Linked to glib2 ? That'll connect to DConf
> over DBus UNIX socket in the background. etc
Oh, too bad. Aren't there at least some system calls which could be used
to block TCP/IP connections, while we still allow local UNIX sockets?
... hmm, maybe that's rather something to solve at the SELinux level