* Vivek Goyal (vgo...@redhat.com) wrote: > On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote: > > * Vivek Goyal (vgo...@redhat.com) wrote: > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote: > > > > * Vivek Goyal (vgo...@redhat.com) wrote: > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to > > > > > > root. Keep a > > > > > > whitelisted set of capabilities that we require. This improves > > > > > > security in > > > > > > case virtiofsd is compromised by making it hard for an attacker to > > > > > > gain further > > > > > > access to the system. > > > > > > > > > > Hi Stefan, > > > > > > > > > > I just noticed that this patch set breaks overlayfs on top of > > > > > virtiofs. > > > > > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > > > > > need CAP_SYS_ADMIN. > > > > > > > > > > man xattr says. > > > > > > > > > > Trusted extended attributes > > > > > Trusted extended attributes are visible and accessible > > > > > only to pro‐ > > > > > cesses that have the CAP_SYS_ADMIN capability. Attributes > > > > > in this > > > > > class are used to implement mechanisms in user space (i.e., > > > > > outside the > > > > > kernel) which keep information in extended attributes to which > > > > > ordinary > > > > > processes should not have access. > > > > > > > > > > There is a chance that overlay moves away from trusted xattr in > > > > > future. > > > > > But for now we need to make it work. This is an important use case for > > > > > kata docker in docker build. > > > > > > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" > > > > > and > > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run > > > > > daemon > > > > > with this capaibility. > > > > > > > > I'll admit I don't like the idea of giving it cap_sys_admin. > > > > Can you explain: > > > > a) What overlayfs uses trusted for? > > > > > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it. > > > > Tell me more about this metadata. > > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do? > > It contains path information which is used for lookup into lower layer. > > > Or what happens if I was to write random numbers into OVL_XATTR_NLINK? > > Overlay is storing its metadata in trusted.* xattrs. If a user modifies > metadata, then various kind of bad things can happen. I think one can > do some kind of checks on metadata to make sure it does not crash > atleast. > > And that's true for any filesystem. Isn't. If user manages to modify > metadata outside of filesystem, then lot of bad things can happen. I > thought that's the reason that people are not comfortable with the > idea of allowing mounting filesystem from inside user namespace because > it makes it easy to mount a hand crafted filesystem. > > Anyway, I think overlayfs is just one use case of trusted xattr. Even > if overlayfs moves away from trusted xattr, what about other users. > We need to have a story around how will we support trusted xattrs > safely. > > > > > > > > b) If something nasty was to write junk into the trusted attributes, > > > > what would happen? > > > > > > This directory is owned by guest. So it should be able to write > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right? > > > > Well, we shouldn't be able to break/crash/escape into the host; how > > much does overlayfs validate trusted.* it uses? > > I thought qemu and kvm are the one who should ensure we should not be > able to break out of sandbox. Kernel implementation could be as > buggy as it wanted to be. We are working with this security model > that kernel is completely untrusted.
But with virtiofs we allow the guest to do a lot of filesystem operations on the host. It's the virtiofsd that has to ensure that these are safe and contained within the fs it's exposed; the qemu/kvm can't protect us from that. That's why we sandbox the virtiofsd like we do - if we allow a priviliged guest to perform calls to an unconstrained virtiofsd it would be able to escape. That's what I want to check. Dave > > > > > > c) I see overlayfs has a fallback check if xattr isn't supported at > > > > all - what is the consequence? > > > > > > It falls back to I think read only mode. > > > > It looks like the fallback is more subtle to me: > > /* > > * Check if upper/work fs supports trusted.overlay.* xattr > > */ > > err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0); > > if (err) { > > ofs->noxattr = true; > > ofs->config.index = false; > > ofs->config.metacopy = false; > > pr_warn("upper fs does not support xattr, falling back to > > index=off and metacopy=off.\n"); > > > > but I don't know what index and metacopy are. > > They enable certain features in overlayfs. In fact, we fall back to > lesser capability on if we are running on ext4/xfs. For virtiofs, > we deny the mount completely. > > /* > * We allowed sub-optimal upper fs configuration and don't want to > break > * users over kernel upgrade, but we never allowed remote upper fs, so > * we can enforce strict requirements for remote upper fs. > */ > if (ovl_dentry_remote(ofs->workdir) && > (!d_type || !rename_whiteout || ofs->noxattr)) { > pr_err("upper fs missing required features.\n"); > err = -EINVAL; > goto out; > } > > > > > > For a moment forget about overlayfs. Say a user process in guest with > > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a > > > passthrough filesystem, so it should go through. But currently it > > > wont. > > > > As long as any effects of what it writes are contained to the area of > > the filesystem exposed to the guest, yes - however it worries me what > > the consequences of broken trusted metadata is. If it's delicate enough > > that it's guarded by CAP_SYS_ADMIN someone must have worried about it. > > Agreed that we need to look into whether having CAP_SYS_ADMIN allow > virtiofsd to break out of jail. > > May be we need to provide that remapping trusted xattr feature so > that we don't have to have CAP_SYS_ADMIN in init_user_ns and can > provide this emulation even when running in user namespace. > > Vivek -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK