On Thu, Mar 3, 2016 at 11:52 AM, Chris Metcalf <[email protected]> wrote: > On 03/02/2016 07:36 PM, Andy Lutomirski wrote: >> >> On Mar 2, 2016 12:10 PM, "Chris Metcalf" <[email protected]> wrote: >>> >>> In prepare_exit_to_usermode(), call task_isolation_ready() >>> when we are checking the thread-info flags, and after we've handled >>> the other work, call task_isolation_enter() unconditionally. >>> >>> In syscall_trace_enter_phase1(), we add the necessary support for >>> strict-mode detection of syscalls. >>> >>> We add strict reporting for the kernel exception types that do >>> not result in signals, namely non-signalling page faults and >>> non-signalling MPX fixups. >>> >>> Signed-off-by: Chris Metcalf <[email protected]> >>> --- >>> arch/x86/entry/common.c | 18 ++++++++++++++++-- >>> arch/x86/kernel/traps.c | 2 ++ >>> arch/x86/mm/fault.c | 2 ++ >>> 3 files changed, 20 insertions(+), 2 deletions(-) >>> >>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c >>> index 03663740c866..27c71165416b 100644 >>> --- a/arch/x86/entry/common.c >>> +++ b/arch/x86/entry/common.c >>> @@ -21,6 +21,7 @@ >>> #include <linux/context_tracking.h> >>> #include <linux/user-return-notifier.h> >>> #include <linux/uprobes.h> >>> +#include <linux/isolation.h> >>> >>> #include <asm/desc.h> >>> #include <asm/traps.h> >>> @@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct >>> pt_regs *regs, u32 arch) >>> */ >>> if (work & _TIF_NOHZ) { >>> enter_from_user_mode(); >>> + if (task_isolation_check_syscall(regs->orig_ax)) { >>> + regs->orig_ax = -1; >>> + return 0; >>> + } >> >> This needs a comment indicating the intended semantics. >> And I've still heard no explanation of why this part can't use seccomp. > > > Here's an excerpt from my earlier reply to you from: > > https://lkml.kernel.org/r/[email protected] > > Admittedly this patch series has been moving very slowly through > review, so it's not surprising we have to revisit some things! > > On 07/21/2015 03:34 PM, Chris Metcalf wrote: >> >> On 07/13/2015 05:47 PM, Andy Lutomirski wrote: >>> >>> If a user wants a syscall to kill them, use >>> seccomp. The kernel isn't at fault if the user does a syscall when it >>> didn't want to enter the kernel. >> >> >> Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT >> was to what I wanted here. One concern is that there doesn't seem >> to be a way to "escape" from seccomp strict mode, i.e. you can't >> call seccomp() again to turn it off - which makes sense for seccomp >> since it's a security issue, but not so much sense with cpu_isolated. >> >> So, do you think there's a good role for the seccomp() API to play >> in achieving this goal? It's certainly not a question of "the kernel at >> fault" but rather "asking the kernel to help catch user mistakes" >> (typically third-party libraries in our customers' experience). You >> could imagine a SECCOMP_SET_MODE_ISOLATED or something. >> >> Alternatively, we could stick with the API proposed in my patch >> series, or something similar, and just try to piggy-back on the seccomp >> internals to make it happen. It would require Kconfig to ensure >> that SECCOMP was enabled though, which obviously isn't currently >> required to do cpu isolation. > > > On looking at this again just now, one thing that strikes me is that > it may not be necessary to forbid the syscall like seccomp does. > It may be sufficient just to trigger the task isolation strict signal > and then allow the syscall to complete. After all, we don't "fail" > any of the other things that upset strict mode, like page faults; we > let them complete, but add a signal. So for consistency, I think it > may in fact make sense to simply trigger the signal but let the > syscall do its thing. After all, perhaps the signal is handled > and logged and we don't mind having the application continue; the > signal handler can certainly choose to fail hard, or in the usual > case of no signal handler, that kills the task just fine too. > Allowing the syscall to complete is really kind of incidental.
No, don't do that. First, if you have a signal pending, a lot of syscalls will abort with -EINTR. Second, if you fire a signal on entry via sigreturn, you're not going to like the results. > > After the changes proposed lower down in this email, this call > site will end up looking like: > > /* Generate a task isolation strict signal if necessary. */ > if ((work & _TIF_TASK_ISOLATION) && task_isolation_strict()) > task_isolation_syscall(regs->orig_ax); > > But do let me know if you think more specifically that there is > a way seccomp can be helpful. Let task isolation users who want to detect when they screw up and do a syscall do it with seccomp. > > >>> work &= ~_TIF_NOHZ; >>> } >>> #endif >>> @@ -254,17 +259,26 @@ static void exit_to_usermode_loop(struct pt_regs >>> *regs, u32 cached_flags) >>> if (cached_flags & _TIF_USER_RETURN_NOTIFY) >>> fire_user_return_notifiers(); >>> >>> + task_isolation_enter(); >>> + >>> /* Disable IRQs and retry */ >>> local_irq_disable(); >>> >>> cached_flags = >>> READ_ONCE(pt_regs_to_thread_info(regs)->flags); >>> >>> - if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS)) >>> + if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS) && >>> + task_isolation_ready()) >>> break; >>> >>> } >>> } >>> >>> +#ifdef CONFIG_TASK_ISOLATION >>> +# define EXIT_TO_USERMODE_FLAGS (EXIT_TO_USERMODE_LOOP_FLAGS | >>> _TIF_NOHZ) >>> +#else >>> +# define EXIT_TO_USERMODE_FLAGS EXIT_TO_USERMODE_LOOP_FLAGS >>> +#endif >>> + >> >> TIF_NOHZ is already a hack and IMO this just makes it worse. At the >> very least this should have a comment. It really ought to be either a >> static_branch or a flag that's actually switched per cpu. >> But this is also a confusing change. Don't override the definition >> here -- stick it in the header where it belongs. > > > The EXIT_TO_USERMODE_xxx stuff is only defined in common.c, not in a header. > > The more I look at this, though, the more I think it would be cleanest > to add a new flag _TIF_TASK_ISOLATION that is set just for tasks that have > enabled task isolation. That can be used unconditionally to check to see > if we need to call exit_to_usermode_loop() from prepare_exit_to_usermode(), > and also means that non-task-isolation tasks don't need to go into the > exit loop unconditionally, which is obviously a performance drag for them. Sounds better to me. --Andy

