On Monday, November 05, 2012 09:39:46 AM Corey Bryant wrote: > On 11/02/2012 06:14 PM, Paul Moore wrote: > > On Friday, November 02, 2012 06:00:29 PM Corey Bryant wrote: > >> On 11/02/2012 05:29 PM, Paul Moore wrote: > >>> On Tuesday, October 23, 2012 03:55:31 AM Eduardo Otubo wrote: > >>>> This patch includes a second whitelist right before the main loop. It's > >>>> a smaller and more restricted whitelist, excluding execve() among many > >>>> others. > >>>> > >>>> v2: * ctx changed to main_loop_ctx > >>>> > >>>> * seccomp_on now inside ifdef > >>>> * open syscall added to the main_loop whitelist > >>>> > >>>> Signed-off-by: Eduardo Otubo <ot...@linux.vnet.ibm.com> > >>> > >>> Unfortunately qemu.org seems to be down for me today so I can't grab the > >>> latest repo to review/verify this patch (some of my comments/assumptions > >>> below may be off) but I'm a little confused, hopefully you guys can help > >>> me out, read below ... > >>> > >>> The first call to seccomp_install_filter() will setup a whitelist for > >>> the > >>> syscalls that have been explicitly specified, all others will hit the > >>> default action TRAP/KILL. The second call to seccomp_install_filter() > >>> will add a second whitelist for another set of explicitly specified > >>> syscalls, all others will hit the default action TRAP/KILL. > >> > >> That's correct. The goal was to have a 2nd list that is a subset of the > >> 1st list, and also not include execve() in the 2nd list. At this point > >> though, since it's late in the release, we've expanded the 2nd list to > >> be the same as the 1st with the exception of execve() not being in the > >> 2nd list. > >> > >>> The problem occurs when the filters are executed in the kernel when a > >>> syscall is executed. On each syscall the first filter will be executed > >>> and the action will either be ALLOW or TRAP/KILL, next the second filter > >>> will be executed and the action will either be ALLOW or TRAP/KILL; since > >>> the kernel always takes the most restrictive (lowest integer action > >>> value) action when multiple filters are specified, I think your double > >>> whitelist value is going to have some inherent problems. > >> > >> That's something I hadn't thought of. But TRAP and KILL won't exist > >> together in our whitelists, and our 2nd whitelist is a subset of the > >> 1st. So do you think there would still be problems? > > > > It doesn't really matter if the default action is TRAP and/or KILL, the > > point is that if you use a second whitelist after an initial whitelist > > the effective seccomp filter is going to be only the syscalls you > > explicitly allowed in the second whitelist. When using multiple seccomp > > filters on a process, all filters are executed for each syscall and the > > most restrictive action of all the filters is the action that the kernel > > takes. > > > > Don't get me wrong, I like the idea of progressively restricting QEMU, but > > if you are going to load multiple seccomp filters into the kernel, you > > almost certainly only want the first whitelist filter to be the union of > > all the seccomp filter you intend to load with all subsequent filters > > being blacklists which progressively remove syscalls which are allowed by > > the initial whitelist. > > That's what we're doing though. The first whitelist is a union of all > subsequent filters. Of course there's only one subsequent filter at > this point. But the idea is to start out with a large whitelist for > initialization and then tighten it up before the main loop when > presumably less syscalls are needed.
Okay, that's good ... It still seems a bit odd to me, I think a whitelist 1st blacklist 2nd is a more intuitive and efficient solution but that may just be me. > My concern is getting the two whitelists correct. We keep uncovering > new syscalls as we test. Of course, this whole whitelist/blacklist discussion assumes the list of allowed syscalls is correct. > >>> I might suggest an initial, fairly permissive > >>> whitelist followed by a follow-on blacklist if you want to disable > >>> certain > >>> syscalls. > >> > >> I have to admit I'm nervous about this at this point in QEMU 1.3. It's > >> getting late in the cycle and we'd hoped to get this in earlier. A more > >> permissive whitelist is probably going to be the only way we'll > >> successfully turn -sandbox on by default at this point in QEMU 1.3. > > > > Thats fine, I just wanted to point out that I think the multiple whitelist > > approach is going to have some inherent problems. > > Are you thinking there will be problems with the current two-whitelist > approach, or are you thinking there would be problems in the future if > we continued restricting the QEMU process with further whitelists? If > you mean the latter, then I understand your point since QEMU is a single > process that requires a certain subset of syscalls. I was originally concerned that you were structuring the whitelists incorrectly, but it sounds like that is not the case - that's good. I'm still concerned that the double whitelist approach may result in bigger syscall filters than necessary but until we get a final-ish list there is no point worrying about that. > I'm thinking once the two whitelists are in place, we can move on to > restricting syscall parameters in the existing whitelists where it makes > sense ... Yep, sounds reasonable. > and then look into your original decomposition approach, where > parts of qemu are run in separate threads/processes which would allow > much tighter seccomp restriction. Ultimately I think this is the right solution if we want to get serious about making QEMU more resistant to attacks from malicious guests. -- paul moore security and virtualization @ redhat