Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?

Daniel P. Berrange Thu, 16 Feb 2017 01:34:06 -0800

On Thu, Feb 16, 2017 at 09:38:59AM +0100, Thomas Huth wrote:
> On 15.02.2017 19:27, Daniel P. Berrange wrote:
> > The current impl of seccomp in QEMU is intentionally allowing a huge range
> > of system calls to be executed. The goal was that running '-sandbox on'
> > should never break any feature of QEMU, so naturally any syscall that can
> > executed on any codepath QEMU takes must be allowed.
> > 
> > This is good for usability because users don't need to understand the 
> > technical
> > details of the sandbox technology, they merely say "on" and it "just works".
> > Conversely though, this is bad for security because QEMU has to allow a huge
> > range of system calls to be used due to its broad functionality.
> > 
> > During initial discussions for seccomp back in 2012 it was suggested, there
> > might be alternate policies developed for QEMU which deny some features, but
> > improve security overall. To best of my knowledge, this has never been 
> > discussed
> > again since then.
> > 
> > 
> > In addition, since initially merging, there has been a steady stream of 
> > patches
> > to whitelist further syscalls that were missing. Some of these were missing 
> > due
> > to newly added functionality in QEMU since the original seccomp impl, while
> > others have been missing since day 1. It is reasonable to expect that there 
> > are
> > still many syscalls missing in the whitelist. In just a couple of minutes of
> > comparing the whitelist vs global syscall list it was possible to identify 
> > two
> > further missing syscalls. The '-netdev bridge,br=virbr0' network backend 
> > fails
> > because setuid is blocked, preventing execution of the qemu-bridge-helper
> > program. If built against glibc < 2.9, or running on kernel < 2.6.27 it will
> > fail to call eventfd() because we only permit eventfd2() syscall, not the
> > older eventfd() syscall used on older Linux. Some ifup scripts used with the
> > -netdev arg may also break due to lack of chmod, flock, getxattr 
> > permissions.
> > This risk of missing syscalls is why -sandbox defaults to off, and we've 
> > never
> > considered defaulting it to on.
> > 
> > 
> > The fundamental problem is that building a whitelist of syscalls used by 
> > QEMU
> > emulators is an intractable problem. QEMU on my system links to 183 
> > different
> > shared libraries and there is no way in the world that anyone can figure out
> > which code paths QEMU triggers in these libraries and thus identify which
> > syscalls will be genuinely needed.
> > 
> > Thus a whitelist based approach for QEMU is doomed to always be missing some
> > syscalls, resulting in uneccessary abrts of QEMU when it tickles some edge
> > case. If you are lucky the abort() happens at startup so you see it quickly
> > and can address it. If you are unlucky the abort() happens after your VM has
> > been running for days/week/months and you loose data.
> > 
> > IOW, seccomp integration as it currently exists today in QEMU offers minimal
> > security benefits, while at the same time causing spurious crashes which may
> > cause user data loss from aborting a running VM, discouraging users from 
> > using
> > even the minimal protection it offers.
> > 
> > I think we need to rework our seccomp support so that we can have a high 
> > enough
> > level of confidence in it, that it could be enabled by default. At the same 
> > time
> > we need to make it do something more tangibly useful from a security POV.
> > 
> > 
> > First we need to admit that whitelisting is a failed approach, and switch to
> > using blacklisting. Unless we do this, we'll never have high enough 
> > confidence
> > to enable it by default - something that's never turned on might as well not
> > exist at all.
> > 
> > 
> > There is a reasonable easily identifiable set of syscalls that QEMU should
> > never be permitted to use, no matter what configuration it is in, what 
> > helpers
> > it spawns, or what libraries it links to. eg reboot, swapon, swapoff,  
> > syslog,
> > mount, unmount, kexec_*, etc - any syscall that affects global system state,
> > rather than process local state should be forbidden.
> > 
> > There are some syscalls that are simply hardcoded to return ENOSYS which can
> > be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the
> > man page 'unimplemented(2)').
> > 
> > There are some syscalls which are considered obsolete - they were previously
> > useful, but no modern code would call them, as they have been superceeded.
> > For example, readdir replaced by getdents. We could blacklist these by 
> > default
> > but provide a way to allow use of obsolete syscalls if running on older 
> > systems.
> > e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we 
> > decide
> > to just block them permanently with no opt in - would need to analyse when
> > their replacements appeared in widespread use.
> > 
> > There might be a few more syscalls which we can determine are never valid to
> > use in QEMU or any library or helper program it might run. I expect this 
> > list
> > to be very small though, given the impossibility of auditing code paths 
> > through
> > millions of lines of code QEMU links to.
> > 
> > Everything else should be allowed.
> > 
> > At this point we have a highly reliable "-sandbox on" which we're not having
> > to constantly patch.
> > 
> > From here we need a way to allow a user to opt-in to more restrictive 
> > policies,
> > accepting that it will block certain features. For example, there should be 
> > a
> > a way to disable any means to elevate privileges from QEMU or things it 
> > spawns.
> > e.g. '-sandbox on,elevateprivileges=deny'.
> > 
> > This would not only block the variuous set*uid|gid functions via seccomp, 
> > but
> > should also prctl(PR_SET_NO_NEW_PRIVS). This would allows the user to optin 
> > to
> > a restrictive world if they know they'll not require things like the setuid
> > bridge helper.
> > 
> > Similarly there should be an '-sandbox on,spawn=deny' which prevents the 
> > ability
> > to fork/exec processes at all, whether privileged or not. This would block
> > features like the qemu bridge helper, SMB server, ifup/down scripts, 
> > migration
> > exec: protocol. These are all rarely used features though, so an opt-in to 
> > block
> > their use is reasonable & desirable.
> > 
> > A -sandbox on,resourcecontrol=deny, which prevents QEMU from setting stuff 
> > like
> > process affinity, schedular priority, etc. Some uses of QEMU might need 
> > them,
> > but normally such controls are left to the mgmt app above QEMU to set prior 
> > to
> > the exec() of QEMU.
> 
> I like your proposal! I just wanted to add an idea for an additional
> parameter (not sure whether it is feasible, though): Something like
> "-sandbox on,network=off" ... i.e. forbid all system calls that are used
> for networking. Rationale: Sometimes your VM does not need any
> networking, and you want to make sure that a malicious guest can also
> not reach your local network in that case.


This is pretty tricky. Even if there is not obviously configured network
backend in QEMU, there's plenty of scope for things in libraries to
be using networking. Something want a fully qualified hostname ? That'll
trigger UDP / TCP connections to a DNS resolver. Running with the SDL
or GTK display frontends - those use networking over UNIX sockets to
talk to a display server. Linked to glib2 ? That'll connect to DConf
over DBus UNIX socket in the background. etc

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?

Reply via email to