Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Peter Zijlstra pet...@infradead.org wrote: But face it, you can argue until you're blue in the face, That is not a technical argument though - and i considered and answered every valid technical argument made by you and Thomas. You were either not able to or not willing to counter them. [...] but both tglx and I will NAK any and all patches that extend perf/ftrace beyond the passive observing role. The thing is, perf is *already* well beyond the 'passive observer' role: we already generate lots of 'action' in response to events. We generate notification signals, we write events - all of which can (and does) modify program behavior. So what's your point? There's no passive observer role really - it's apparently just that you dislike this use of instrumentation while you approve of other uses. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Mon 2011-05-16 10:36:05, James Morris wrote: On Fri, 13 May 2011, Ingo Molnar wrote: How do you reason about the behavior of the system as a whole? I argue that this is the LSM and audit subsystems designed right: in the long run it could allow everything that LSM does at the moment - and so much more ... Now you're proposing a redesign of the security subsystem. That's a significant undertaking. In the meantime, we have a simple, well-defined enhancement to seccomp which will be very useful to current users in reducing their kernel attack surface. Well, you can do the same with subterfugue, even without kernel changes. But that's ptrace -- slow. (And it already shows that syscall based filters are extremely tricky to configure). If yu want speed, seccomp+server for non-permitted operations seems like reasonable way. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Pavel Machek pa...@ucw.cz wrote: On Mon 2011-05-16 10:36:05, James Morris wrote: On Fri, 13 May 2011, Ingo Molnar wrote: How do you reason about the behavior of the system as a whole? I argue that this is the LSM and audit subsystems designed right: in the long run it could allow everything that LSM does at the moment - and so much more ... Now you're proposing a redesign of the security subsystem. That's a significant undertaking. In the meantime, we have a simple, well-defined enhancement to seccomp which will be very useful to current users in reducing their kernel attack surface. Well, you can do the same with subterfugue, even without kernel changes. But that's ptrace -- slow. (And it already shows that syscall based filters are extremely tricky to configure). Yes, if you use syscall based filters to implement access to underlying objects where the access methods do not capture essential lifetime events properly (such as files) they you'll quickly run into trouble achieving a secure solution. But you can robustly use syscall filters to control the underlying primary *resource*: various pieces of kernel code with *negative* utility to the current app - which have no use to the app but pose risks in terms of potential exploits in them. But you can use event filters to implement arbitrary security policies robustly. For example file objects: if you generate the right events for a class of objects then you can control access to them very robustly. It's not a surprise that this is what SELinux does primarily: it has lifetime event hooks at the inode object (and socket, packet, etc.) level and captures those access attempts and validates them against the permissions of that object, in light of the accessing task's credentials. Exactly that can be done with Will's patch as well, if its potential scope of event-checking points is not stupidly limited to the syscall boundary alone ... Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Thomas Gleixner t...@linutronix.de wrote: We do _NOT_ make any decision based on the trace point so what's the pre-existing active role in the syscall entry code? The seccomp code we are discussing in this thread. That's proposed code and has absolutely nothing to do with the existing trace point semantics. So because it's proposed code it does not exist? If the feature is accepted (and given Linus's opinion it's not clear at all it's accepted in any form) then it's obviously a very legitimate technical concern whether we do: ret = seccomp_check_syscall_event(p1, p2, p3, p4, p5); if (ret) return -EACCES; ... random code ... trace_syscall_event(p1, p2, p3, p4, p5); Where seccomp_check_syscall_event() duplicates much of the machinery that is behind trace_syscall_event(). Or we do the more intelligent: ret = check_syscall_event(p1, p2, p3, p4, p5); if (ret) return -EACCES; Where we have the happy side effects of: - less code at the call site - (a lot of!) shared infrastructure between the proposed seccomp code and event filters. - we'd also be able to trace at security check boundaries - which has obvious bug analysis advantages. In fact i do not see *any* advantages in keeping this needlessly bloaty and needlessly inconsistently sampled form of instrumentation: ret = seccomp_check_syscall_event(p1, p2, p3, p4, p5); if (ret) return -EACCES; ... random code ... trace_syscall_event(p1, p2, p3, p4, p5); Do you? Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Tue, 24 May 2011, Ingo Molnar wrote: * Peter Zijlstra pet...@infradead.org wrote: On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: include/linux/ftrace_event.h |4 +- include/linux/perf_event.h| 10 +--- kernel/perf_event.c | 49 +--- kernel/seccomp.c |8 ++ kernel/trace/trace_syscalls.c | 27 +- 5 files changed, 82 insertions(+), 16 deletions(-) I strongly oppose to the perf core being mixed with any sekurity voodoo (or any other active role for that matter). I'd object to invisible side-effects as well, and vehemently so. But note how intelligently it's used here: it's explicit in the code, it's used explicitly in kernel/seccomp.c and the event generation place in kernel/trace/trace_syscalls.c. So this is a really flexible solution IMO and does not extend events with some invisible 'active' role. It extends the *call site* with an open-coded active role - which active role btw. already pre-existed. We do _NOT_ make any decision based on the trace point so what's the pre-existing active role in the syscall entry code? I'm all for code reuse and reuse of interfaces, but this is completely wrong. Instrumentation and security decisions are two fundamentally different things and we want them kept separate. Instrumentation is not meant to make decisions. Just because we can does not mean that it is a good idea. So what the current approach does is: - abuse the existing ftrace syscall hook by adding a return value to the tracepoint. So we need to propagate that for every tracepoint just because we have a single user. - abuse the perf per task mechanism Just because we have per task context in perf does not mean that we pull everything and the world which requires per task context into perf. The security folks have per task context already so security related stuff wants to go there. - abuse the perf/ftrace interfaces One of the arguments was that perf and ftrace have permission which are not available from the existing security interfaces. That's not at all a good reason to abuse these interfaces. Let the security folks sort out the problem on their end and do not impose any expectations on perf/ftrace which we have to carry around forever. Yes, it can be made working with a relatively small patch, but it has a very nasty side effect: You add another user space visible ABI to the existing perf/ftrace mess which needs to be supported forever. Brilliant, we have already two ABIs (perf/ftrace) to support and at the same time we urgently need to solve the problem of better integration of those two. So adding a third completely unrelated component with a guaranteed ABI is just making this even more complex. We can factor out the filtering code and let the security dudes reuse it for their own purposes. That makes them to have their own interfaces and does not impose any restrictions upon the tracing/perf ones. And really security stuff wants to be integrated into the existing security frameworks and not duct taped into perf/trace just because it's a conveniant hack around limitiations of the existing security stuff. You really should stop to see everything as a nail just because the only tool you have handy is the perf hammer. perf is about instrumentation and we don't want to violate the oldest principle of unix to have simple tools which do one thing and do it good. Even swiss army knifes have the restriction that you can use only one tool at a time unless you want to stick the corkscrew through your palm when you try to cut bread. Thanks, tglx ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Thomas Gleixner t...@linutronix.de wrote: On Tue, 24 May 2011, Ingo Molnar wrote: * Peter Zijlstra pet...@infradead.org wrote: On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: include/linux/ftrace_event.h |4 +- include/linux/perf_event.h| 10 +--- kernel/perf_event.c | 49 +--- kernel/seccomp.c |8 ++ kernel/trace/trace_syscalls.c | 27 +- 5 files changed, 82 insertions(+), 16 deletions(-) I strongly oppose to the perf core being mixed with any sekurity voodoo (or any other active role for that matter). I'd object to invisible side-effects as well, and vehemently so. But note how intelligently it's used here: it's explicit in the code, it's used explicitly in kernel/seccomp.c and the event generation place in kernel/trace/trace_syscalls.c. So this is a really flexible solution IMO and does not extend events with some invisible 'active' role. It extends the *call site* with an open-coded active role - which active role btw. already pre-existed. We do _NOT_ make any decision based on the trace point so what's the pre-existing active role in the syscall entry code? The seccomp code we are discussing in this thread. I'm all for code reuse and reuse of interfaces, but this is completely wrong. Instrumentation and security decisions are two fundamentally different things and we want them kept separate. Instrumentation is not meant to make decisions. Just because we can does not mean that it is a good idea. Instrumentation does not 'make decisions': the calling site, which is already emitting both the event and wants to do decisions based on the data that also generates the event wants to do decisions. Those decisions *will be made* and you cannot prevent that, the only open question is can it reuse code intelligently, which code it is btw. already calling for observation reasons? ( Note that pure observers wont be affected and note that pure observation call sites are not affected either. ) So what the current approach does is: - abuse the existing ftrace syscall hook by adding a return value to the tracepoint. So we need to propagate that for every tracepoint just because we have a single user. This is a technical criticism i share with you and i think it can be fixed - i outlined it to Will yesterday. And no, if done cleanly it's not 'abuse' to reuse code. Could we wait for the first clean iteration of this patch instead of rushing judgement prematurely? - abuse the perf per task mechanism Just because we have per task context in perf does not mean that we pull everything and the world which requires per task context into perf. The security folks have per task context already so security related stuff wants to go there. We do not pull 'everything and the world' in, but code that wants to process events in places that already emit events surely sounds related to me :-) - abuse the perf/ftrace interfaces One of the arguments was that perf and ftrace have permission which are not available from the existing security interfaces. That's not at all a good reason to abuse these interfaces. Let the security folks sort out the problem on their end and do not impose any expectations on perf/ftrace which we have to carry around forever. Yes, it can be made working with a relatively small patch, but it has a very nasty side effect: You add another user space visible ABI to the existing perf/ftrace mess which needs to be supported forever. What mess? I'm not aware of a mess - other than the ftrace API which is not used by this patch. Brilliant, we have already two ABIs (perf/ftrace) to support and at the same time we urgently need to solve the problem of better integration of those two. So adding a third completely unrelated component with a guaranteed ABI is just making this even more complex. So your solution is to add yet another ABI for seccomp and to keep seccomp a limited hack forever, just because you are not interested in security? I think we want fewer ABIs and more flexible/reusable facilities. We can factor out the filtering code and let the security dudes reuse it for their own purposes. That makes them to have their own interfaces and does not impose any restrictions upon the tracing/perf ones. And really security stuff wants to be integrated into the existing security frameworks and not duct taped into perf/trace just because it's a conveniant hack around limitiations of the existing security stuff. You are missing what i tried to point out in earlier discussions: from a security design POV this isnt just about the system call boundary. If this seccomp variant is based on events then it could grow proper security checks in other places as well, in places where we have some
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Wed, 2011-05-25 at 17:01 +0200, Ingo Molnar wrote: We do _NOT_ make any decision based on the trace point so what's the pre-existing active role in the syscall entry code? The seccomp code we are discussing in this thread. That isn't pre-existing, that's proposed. But face it, you can argue until you're blue in the face, but both tglx and I will NAK any and all patches that extend perf/ftrace beyond the passive observing role. Your arguments appear to be as non-persuasive to us as ours are to you, so please drop this endeavor and let the security folks sort it on their own and let's get back to doing useful work. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Wed, 25 May 2011, Ingo Molnar wrote: * Thomas Gleixner t...@linutronix.de wrote: On Tue, 24 May 2011, Ingo Molnar wrote: * Peter Zijlstra pet...@infradead.org wrote: On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: include/linux/ftrace_event.h |4 +- include/linux/perf_event.h| 10 +--- kernel/perf_event.c | 49 +--- kernel/seccomp.c |8 ++ kernel/trace/trace_syscalls.c | 27 +- 5 files changed, 82 insertions(+), 16 deletions(-) I strongly oppose to the perf core being mixed with any sekurity voodoo (or any other active role for that matter). I'd object to invisible side-effects as well, and vehemently so. But note how intelligently it's used here: it's explicit in the code, it's used explicitly in kernel/seccomp.c and the event generation place in kernel/trace/trace_syscalls.c. So this is a really flexible solution IMO and does not extend events with some invisible 'active' role. It extends the *call site* with an open-coded active role - which active role btw. already pre-existed. We do _NOT_ make any decision based on the trace point so what's the pre-existing active role in the syscall entry code? The seccomp code we are discussing in this thread. That's proposed code and has absolutely nothing to do with the existing trace point semantics. I'm all for code reuse and reuse of interfaces, but this is completely wrong. Instrumentation and security decisions are two fundamentally different things and we want them kept separate. Instrumentation is not meant to make decisions. Just because we can does not mean that it is a good idea. Instrumentation does not 'make decisions': the calling site, which is already emitting both the event and wants to do decisions based on the data that also generates the event wants to do decisions. You can repeat that as often as you want, it does not make it more true. Fact is that the decision is made in the middle of the perf code. + /* Transition the task if required. */ + if (ctx-type == task_context event-attr.require_secure) { +#ifdef CONFIG_SECCOMP + /* Don't allow perf events to escape mode = 1. */ +if (!current-seccomp.mode) + current-seccomp.mode = 2; +#endif + } and further down + if (event-attr.err_on_discard) + ok = -EACCES; Those decisions *will be made* and you cannot prevent that, the only open question is can it reuse code intelligently, which code it is btw. already calling for observation reasons? The tracepoint is called for observation reasons and now you make it a decision function. That's what I call abuse. ( Note that pure observers wont be affected and note that pure observation call sites are not affected either. ) Hahaha, they still have to run through the additional code when seccomp is enabled and we still have to propagate the return value down to the point where the tracepoint itself is. You call that not affected? So what the current approach does is: - abuse the existing ftrace syscall hook by adding a return value to the tracepoint. So we need to propagate that for every tracepoint just because we have a single user. This is a technical criticism i share with you and i think it can be fixed - i outlined it to Will yesterday. And no, if done cleanly it's not 'abuse' to reuse code. Could we wait for the first clean iteration of this patch instead of rushing judgement prematurely? There is no way to do it cleanly. It always comes for the price that you add additional code into the tracing code path. And there are other people who try hard to remove stuff to recude the overhead which is caused by instrumentation. - abuse the perf per task mechanism Just because we have per task context in perf does not mean that we pull everything and the world which requires per task context into perf. The security folks have per task context already so security related stuff wants to go there. We do not pull 'everything and the world' in, but code that wants to process events in places that already emit events surely sounds related to me :-) We have enough places where different independent parts of the kernel want to hook into for obvious reasons. We have notifiers for those where performance does not matter much and we have separate calls into the independent functions where it matters or where we need to evaluate the results in specific ways. So now you turn instrumentation into a security mechanism, which works nicely for a particular purpose, i.e. decision on a particular syscall number. Now, how do you make that work when a decision has to be made on more than a simple match, e.g. syscall number + arguments ? Not
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Thu, May 19, 2011 at 4:05 PM, Will Drewry w...@chromium.org wrote: On Thu, May 19, 2011 at 7:22 AM, Steven Rostedt rost...@goodmis.org wrote: On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote: Do event_* that return non-void exist in the tree at all now? I've looked at the various tracepoint macros as well as some of the other handlers (trace_function, perf_tp_event, etc) and I'm not seeing any places where a return value is honored nor could be. At best, the perf_tp_event can be short-circuited it in the hlist_for_each, but it'd still need a way to bubble up a failure and result in not calling the trace/event that the hook precedes. No, none of the current trace hooks have return values. That was what I was talking about how to implement in my previous emails. Led on by complete ignorance, I think I'm finally beginning to untwist the different pieces of the tracing infrastructure. Unfortunately, that means I took a few wrong turns along the way... I think function tracing looks something like this: ftrace_call has been injected into at a specific callsite. Upon hit: 1. ftrace_call triggers 2. does some checks then calls ftrace_trace_function (via mcount instrs) 3. ftrace_trace_function may be a single func or a list. For a list it will be: ftrace_list_func 4. ftrace_list_func calls each registered hook for that function in a while loop ignoring return values 5. registered hook funcs may then track the call, farm it out to specific sub-handlers, etc. This seems to be a red herring for my use case :/ though this helped me understand your back and forth (Ingo Steve) regarding dynamic versus explicit events. System call tracing is done via kernel/tracepoint.c events fed in via arch/[arch]/kernel/ptrace.c where it calls trace_sys_enter. This yields direct sys_enter and sys_exit event sources (and an event class to hook up per-system call events). This means that ftrace_syscall_enter() does the event prep prior to doing a filter check comparing the ftrace_event object for the given syscall_nr to the event data. perf_sysenter_enter() is similar but it pushes the info over to perf_tp_event to be matched not against the global syscall event entry, but against any sub-event in the linked list on that syscall's event. Is that roughly an accurate representation of the two? I wish I hadn't gotten distracted along the function path, but at least I learned something (and it is relevant to the larger scope of this thread's discussion). After doing that digging, it looks like providing hook call pre-emption and return value propagation will be a unique and fun task for each tracer and event subsystem. If I look solely at tracepoints, a generic change would be to make the trace_##subsys function return an int (which I think was the event_vfs_getname proposal?). The other option is to change the trace_sys_enter proto to include a 'int *retcode'. That change would allow the propagation of some sort of policy information. To put it to use, seccomp mode 1 could be implemented on top of trace_syscalls. The following changes would need to happen: 1. dummy metadata should be inserted for all unhooked system calls 2. perf_trace_buf_submit would need to return an int or a new TRACE_REG_SECCOMP_REGISTER handler would need to be setup in syscall_enter_register. 3. If perf is abused, a kill/fail_on_discard bit would be added to event-attrs. 4. perf_trace_buf_submit/perf_tp_event will return 0 for no matches, 'n' for the number of event matches, and -EACCES/? if a 'fail_on_discard' event is seen. 5. perf_syscall_enter would set *retcode = perf_trace_buf_submit()'s retcode 6. trace_sys_enter() would need to be moved to be the first entry arch/../kernel/ptrace.c for incoming syscalls 7. if trace_sys_enter() yields a negative return code, then do_exit(SIGKILL) the process and return. Entering into seccomp mode 1 would require adding a 0 filter for every system call number (which is why we need a dummy event call for them since failing to check the bitmask can't be flagged fail_on_discard) with the fail_on_discard bit. For the three calls that are allowed, a '1' filter would be set. That would roughly implement seccomp mode 1. It's pretty ugly and the fact that every system call that's disallowed has to be blacklisted is not ideal. An alternate model would be to just use the seccomp mode as we do today and let secure_computing() handle the return code of # of matches. If it the # of matches is 0, it terminates. A 'fail_on_discard' bit then would only be good to stop further tracepoint callback evaluation. This approach would also *almost* nix the need to provide dummy syscall hooks. (Since sigreturn isn't hooked on x86 because it uses ptregs fixup, a dummy would still be needed to apply a 1 filter to.) Even with that tweak to move to a whitelist model, the perf event evaluation and tracepoint callback ordering is
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: include/linux/ftrace_event.h |4 +- include/linux/perf_event.h| 10 +--- kernel/perf_event.c | 49 +--- kernel/seccomp.c |8 ++ kernel/trace/trace_syscalls.c | 27 +- 5 files changed, 82 insertions(+), 16 deletions(-) I strongly oppose to the perf core being mixed with any sekurity voodoo (or any other active role for that matter). ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Tue, 24 May 2011, Peter Zijlstra wrote: On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: include/linux/ftrace_event.h |4 +- include/linux/perf_event.h| 10 +--- kernel/perf_event.c | 49 +--- kernel/seccomp.c |8 ++ kernel/trace/trace_syscalls.c | 27 +- 5 files changed, 82 insertions(+), 16 deletions(-) I strongly oppose to the perf core being mixed with any sekurity voodoo (or any other active role for that matter). Amen. We have enough crap to cleanup in perf/ftrace already, so we really do not need security magic added to it. Thanks, tglx ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Tue, May 24, 2011 at 11:25 AM, Thomas Gleixner t...@linutronix.de wrote: On Tue, 24 May 2011, Peter Zijlstra wrote: On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: include/linux/ftrace_event.h | 4 +- include/linux/perf_event.h | 10 +--- kernel/perf_event.c | 49 +--- kernel/seccomp.c | 8 ++ kernel/trace/trace_syscalls.c | 27 +- 5 files changed, 82 insertions(+), 16 deletions(-) I strongly oppose to the perf core being mixed with any sekurity voodoo (or any other active role for that matter). Amen. We have enough crap to cleanup in perf/ftrace already, so we really do not need security magic added to it. Thanks for the quick responses! I agree, but I'm left a little bit lost now w.r.t. the comments around reusing the ABI. If perf doesn't make sense (which certainly seems wrong from a security interface perspective), then the preexisting ABIs I know of for this case are as follows: - /sys/kernel/debug/tracing/* - prctl(PR_SET_SECCOMP* (or /proc/...) Both would require expansion. The latter was reused by the original patch series. The former doesn't expose much in the way of per-task event filtering -- ftrace_pids doesn't translate well to ftrace_syscall_enter-based enforcement. I'd expect we'd need ftrace_event_call-task_events (like -perf_events), and either explore them in ftrace_syscall_enter or add a new tracepoint handler, ftrace_task_syscall_enter, via something like TRACE_REG_TASK_REGISTER. It could then do whatever it wanted with the successful or unsuccessful matching against predicates, stacking or not, which could be used for a seccomp-like mechanism. However, bubbling that change up to the non-existent interfaces in debug/tracing could be a challenge too (Registration would require an alternate flow like perf to call TRACE_REG_*? Do they become tracing/events/subsystem/event/task/tid/filter_string_N? ...?). This is all just a matter of programming... but at this point, I'm not seeing the clear shared path forward. Even with per-task ftrace access in debug/tracing, that would introduce a reasonably large change to the system and add a new ABI, albeit in debug/tracing. If the above (or whatever the right approach is) comes into existence, then any prctl(PR_SET_SECCOMP) ABI could have the backend implementation to modify the same data. I'm not putting it like this to say that I'm designing to be obsolete, but to show that the defined interface wouldn't conflict if ftrace does overlap more in the future. Given the importance of a clearly defined interface for security functionality, I'd be surprised to see all the pieces come together in the near future in such a way that a transition would be immediately possible -- I'm not even sure what the ftrace roadmap really is! Would it be more desirable to put a system call filtering interface on a miscdev (like /dev/syscall_filter) instead of in /proc or prctl (and not reuse seccomp at all)? I'm not clear what the onus is to justify a change in the different ABI areas, but I see system call filtering as an important piece of system security and would like to determine if there is a viable path forward, or if this will need to be revisited in another 2 years. thanks again! will ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Peter Zijlstra pet...@infradead.org wrote: On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: include/linux/ftrace_event.h |4 +- include/linux/perf_event.h| 10 +--- kernel/perf_event.c | 49 +--- kernel/seccomp.c |8 ++ kernel/trace/trace_syscalls.c | 27 +- 5 files changed, 82 insertions(+), 16 deletions(-) I strongly oppose to the perf core being mixed with any sekurity voodoo (or any other active role for that matter). I'd object to invisible side-effects as well, and vehemently so. But note how intelligently it's used here: it's explicit in the code, it's used explicitly in kernel/seccomp.c and the event generation place in kernel/trace/trace_syscalls.c. So this is a really flexible solution IMO and does not extend events with some invisible 'active' role. It extends the *call site* with an open-coded active role - which active role btw. already pre-existed. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Will Drewry w...@chromium.org wrote: The change avoids defining a new trace call type or a huge number of internal changes and hides seccomp.mode=2 from ABI-exposure in prctl, but the attack surface is non-trivial to verify, and I'm not sure if this ABI change makes sense. It amounts to: include/linux/ftrace_event.h |4 +- include/linux/perf_event.h| 10 +--- kernel/perf_event.c | 49 +--- kernel/seccomp.c |8 ++ kernel/trace/trace_syscalls.c | 27 +- 5 files changed, 82 insertions(+), 16 deletions(-) And can be found here: http://static.dataspill.org/perf_secure/v1/ Wow, i'm very impressed how few changes you needed to do to support this! So, firstly, i don't think we should change perf_tp_event() at all - the 'observer' codepaths should be unaffected. But there could be a perf_tp_event_ret() or perf_tp_event_check() entry that code like seccomp which wants to use event results can use. Also, i'm not sure about the seccomp details and assumptions that were moved into the perf core. How about passing in a helper function to perf_tp_event_check(), where seccomp would define its seccomp specific helper function? That looks sufficiently flexible. That helper function could be an 'extra filter' kind of thing, right? Also, regarding the ABI and the attr.err_on_discard and attr.require_secure bits, they look a bit too specific as well. attr.err_on_discard: with the filter helper function passed in this is probably not needed anymore, right? attr.require_secure: this is basically used to *force* the creation of security-controlling filters, right? It seems to me that this could be done via a seccomp ABI extension as well, without adding this to the perf ABI. That seccomp call could check whether the right events are created and move the task to mode 2 only if that prereq is met - or something like that. If there is any interest at all, I can post it properly to this giant CC list. [...] I'd suggest to trim the Cc: list aggressively - anyone interested in the discussion can pick it up on lkml - and i strongly suspect that most of the Cc: participants would want to be off the Cc: :-) Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Ingo Molnar mi...@elte.hu wrote: * Peter Zijlstra pet...@infradead.org wrote: On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: include/linux/ftrace_event.h |4 +- include/linux/perf_event.h| 10 +--- kernel/perf_event.c | 49 +--- kernel/seccomp.c |8 ++ kernel/trace/trace_syscalls.c | 27 +- 5 files changed, 82 insertions(+), 16 deletions(-) I strongly oppose to the perf core being mixed with any sekurity voodoo (or any other active role for that matter). I'd object to invisible side-effects as well, and vehemently so. But note how intelligently it's used here: it's explicit in the code, it's used explicitly in kernel/seccomp.c and the event generation place in kernel/trace/trace_syscalls.c. So this is a really flexible solution IMO and does not extend events with some invisible 'active' role. It extends the *call site* with an open-coded active role - which active role btw. already pre-existed. Also see my other mail - i think this seccomp code is too tied in to the perf core and ABI - but this is fixable IMO. The fundamental notion that a generator subsystem of events can use filter results as well (such as kernel/trace/trace_syscalls.c.) for its own purposes is pretty robust though. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Tue, 2011-05-24 at 22:08 +0200, Ingo Molnar wrote: * Will Drewry w...@chromium.org wrote: But there could be a perf_tp_event_ret() or perf_tp_event_check() entry that code like seccomp which wants to use event results can use. We should name it something else. The perf_tp.. is a misnomer as it has nothing to do with performance monitoring. dynamic_event_.. maybe, as it is dynamic to the affect that we can use jump labels to enable or disable it. -- Steve ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote: Do event_* that return non-void exist in the tree at all now? I've looked at the various tracepoint macros as well as some of the other handlers (trace_function, perf_tp_event, etc) and I'm not seeing any places where a return value is honored nor could be. At best, the perf_tp_event can be short-circuited it in the hlist_for_each, but it'd still need a way to bubble up a failure and result in not calling the trace/event that the hook precedes. No, none of the current trace hooks have return values. That was what I was talking about how to implement in my previous emails. -- Steve ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Thu, May 19, 2011 at 7:22 AM, Steven Rostedt rost...@goodmis.org wrote: On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote: Do event_* that return non-void exist in the tree at all now? I've looked at the various tracepoint macros as well as some of the other handlers (trace_function, perf_tp_event, etc) and I'm not seeing any places where a return value is honored nor could be. At best, the perf_tp_event can be short-circuited it in the hlist_for_each, but it'd still need a way to bubble up a failure and result in not calling the trace/event that the hook precedes. No, none of the current trace hooks have return values. That was what I was talking about how to implement in my previous emails. Led on by complete ignorance, I think I'm finally beginning to untwist the different pieces of the tracing infrastructure. Unfortunately, that means I took a few wrong turns along the way... I think function tracing looks something like this: ftrace_call has been injected into at a specific callsite. Upon hit: 1. ftrace_call triggers 2. does some checks then calls ftrace_trace_function (via mcount instrs) 3. ftrace_trace_function may be a single func or a list. For a list it will be: ftrace_list_func 4. ftrace_list_func calls each registered hook for that function in a while loop ignoring return values 5. registered hook funcs may then track the call, farm it out to specific sub-handlers, etc. This seems to be a red herring for my use case :/ though this helped me understand your back and forth (Ingo Steve) regarding dynamic versus explicit events. System call tracing is done via kernel/tracepoint.c events fed in via arch/[arch]/kernel/ptrace.c where it calls trace_sys_enter. This yields direct sys_enter and sys_exit event sources (and an event class to hook up per-system call events). This means that ftrace_syscall_enter() does the event prep prior to doing a filter check comparing the ftrace_event object for the given syscall_nr to the event data. perf_sysenter_enter() is similar but it pushes the info over to perf_tp_event to be matched not against the global syscall event entry, but against any sub-event in the linked list on that syscall's event. Is that roughly an accurate representation of the two? I wish I hadn't gotten distracted along the function path, but at least I learned something (and it is relevant to the larger scope of this thread's discussion). After doing that digging, it looks like providing hook call pre-emption and return value propagation will be a unique and fun task for each tracer and event subsystem. If I look solely at tracepoints, a generic change would be to make the trace_##subsys function return an int (which I think was the event_vfs_getname proposal?). The other option is to change the trace_sys_enter proto to include a 'int *retcode'. That change would allow the propagation of some sort of policy information. To put it to use, seccomp mode 1 could be implemented on top of trace_syscalls. The following changes would need to happen: 1. dummy metadata should be inserted for all unhooked system calls 2. perf_trace_buf_submit would need to return an int or a new TRACE_REG_SECCOMP_REGISTER handler would need to be setup in syscall_enter_register. 3. If perf is abused, a kill/fail_on_discard bit would be added to event-attrs. 4. perf_trace_buf_submit/perf_tp_event will return 0 for no matches, 'n' for the number of event matches, and -EACCES/? if a 'fail_on_discard' event is seen. 5. perf_syscall_enter would set *retcode = perf_trace_buf_submit()'s retcode 6. trace_sys_enter() would need to be moved to be the first entry arch/../kernel/ptrace.c for incoming syscalls 7. if trace_sys_enter() yields a negative return code, then do_exit(SIGKILL) the process and return. Entering into seccomp mode 1 would require adding a 0 filter for every system call number (which is why we need a dummy event call for them since failing to check the bitmask can't be flagged fail_on_discard) with the fail_on_discard bit. For the three calls that are allowed, a '1' filter would be set. That would roughly implement seccomp mode 1. It's pretty ugly and the fact that every system call that's disallowed has to be blacklisted is not ideal. An alternate model would be to just use the seccomp mode as we do today and let secure_computing() handle the return code of # of matches. If it the # of matches is 0, it terminates. A 'fail_on_discard' bit then would only be good to stop further tracepoint callback evaluation. This approach would also *almost* nix the need to provide dummy syscall hooks. (Since sigreturn isn't hooked on x86 because it uses ptregs fixup, a dummy would still be needed to apply a 1 filter to.) Even with that tweak to move to a whitelist model, the perf event evaluation and tracepoint callback ordering is still not guaranteed. Without changing tracepoint itself, all other TPs will still execute. And for perf events, it'll be first-come-first-serve
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Tue, May 17, 2011 at 6:19 AM, Ingo Molnar mi...@elte.hu wrote: * Steven Rostedt rost...@goodmis.org wrote: On Tue, 2011-05-17 at 14:42 +0200, Ingo Molnar wrote: * Steven Rostedt rost...@goodmis.org wrote: On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote: * Steven Rostedt rost...@goodmis.org wrote: I'm a bit nervous about the 'active' role of (trace_)events, because of the way multiple callbacks can be registered. How would: err = event_x(); if (err == -EACCESS) { be handled? [...] The default behavior would be something obvious: to trigger all callbacks and use the first non-zero return value. But how do we know which callback that was from? There's no ordering of what callbacks are called first. We do not have to know that - nor do the calling sites care in general. Do you have some specific usecase in mind where the identity of the callback that generates a match matters? Maybe I'm confused. I was thinking that these event_*() are what we currently call trace_*(), but the event_*(), I assume, can return a value if a call back returns one. Yeah - and the call site can treat it as: - Ugh, if i get an error i need to abort whatever i was about to do or (more advanced future use): - If i get a positive value i need to re-evaluate the parameters that were passed in, they were changed Do event_* that return non-void exist in the tree at all now? I've looked at the various tracepoint macros as well as some of the other handlers (trace_function, perf_tp_event, etc) and I'm not seeing any places where a return value is honored nor could be. At best, the perf_tp_event can be short-circuited it in the hlist_for_each, but it'd still need a way to bubble up a failure and result in not calling the trace/event that the hook precedes. Am I missing something really obvious? I don't feel I've gotten a good handle on exactly how all the tracing code gets triggered, so perhaps I'm still a level (or three) too shallow. (I can see the asm hooks for trace functions and I can see where that translates to registered calls - like trace_function - but I don't see how the hooked calls can be trivially aborted). As is, I'm not sure how the perf and ftrace infrastructure could be reused cleanly without a fair number of hacks to the interface and a good bit of reworking. I can already see a number of challenges around reusing the sys_perf_event_open interface and the fact that reimplementing something even as simple as seccomp mode=1 seems to require a fair amount of tweaking to avoid from being leaky. (E.g., enabling all TRACE_EVENT()s for syscalls will miss unhooked syscalls so either acceptance matching needs to be propagated up the stack along with some seccomp-like task modality or seccomp-on-perf would have to depend on sys_enter events with syscall number predicate matching and fail when a filter discard applies to all active events.) At present, I'm leaning back towards the v2 series (plus the requested minor changes) for the benefit of code clarity and its fail-secure behavior. Even just considering the reduced case of seccomp mode 1 being implemented on the shared infrastructure, I feel like I missing something that makes it viable. Any clues? If not, I don't think a seccomp mode 2 interface via prctl would be intractable if the long term movement is to a ftrace/perf backend - it just means that the in-kernel code would change to wrap whatever the final design ended up being. Thanks and sorry if I'm being dense! Thus, we now have the ability to dynamically attach function calls to arbitrary points in the kernel that can have an affect on the code that called it. Right now, we only have the ability to attach function calls to these locations that have passive affects (tracing/profiling). Well, they can only have the effect that the calling site accepts and handles. So the 'effect' is not arbitrary and not defined by the callbacks, it is controlled and handled by the calling code. We do not want invisible side-effects, opaque hooks, etc. Instead of that we want (this is the getname() example i cited in the thread) explicit effects, like: if (event_vfs_getname(result)) return ERR_PTR(-EPERM); But you say, nor do the calling sites care in general. Then what do these calling sites do with the return code? Are we limiting these actions to security only? Or can we have some other feature. [...] Yeah, not just security. One other example that came up recently is whether to panic the box on certain (bad) events such as NMI errors. This too could be made flexible via the event filter code: we already capture many events, so places that might conceivably do some policy could do so based on a filter condition. This sounds great - I just wish I could figure out how it'd work :) [...] I can envision that we can make the Linux kernel quite dynamic here
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Steven Rostedt rost...@goodmis.org wrote: On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote: * Steven Rostedt rost...@goodmis.org wrote: I'm a bit nervous about the 'active' role of (trace_)events, because of the way multiple callbacks can be registered. How would: err = event_x(); if (err == -EACCESS) { be handled? [...] The default behavior would be something obvious: to trigger all callbacks and use the first non-zero return value. But how do we know which callback that was from? There's no ordering of what callbacks are called first. We do not have to know that - nor do the calling sites care in general. Do you have some specific usecase in mind where the identity of the callback that generates a match matters? Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Will Drewry w...@chromium.org wrote: This is *far* more generic still yields the same short-term end result as far as your sandboxing is concerned. Almost :/ [...] Hey that's a pretty good result from a subsystem that was not written with your usecase in mind *at all* ;-) [...] I still need to review the code you've pointed out, but, at present, the ftrace hooks occur after the seccomp and syscall auditing hooks. This means that that code is exposed no matter what in this model. To trim the exposed surface to userspace, we really need those early hooks. While I can see both hacky and less hacky approaches around this, it stills strikes me that the seccomp thread flag and early interception are good to reuse. One option might be to allow seccomp to be a secure-syscall event source, but I suspect that lands more on the hack-y side of the fence :) Agreed, there should be no security compromise imposed on your usecase, at all. You could move the event callback sooner into the syscall-entry sequence to make sure it's the highest priority thing to process? There's no semantic dependency on its current location so this can be changed AFAICS. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Tue, 2011-05-17 at 14:42 +0200, Ingo Molnar wrote: * Steven Rostedt rost...@goodmis.org wrote: On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote: * Steven Rostedt rost...@goodmis.org wrote: I'm a bit nervous about the 'active' role of (trace_)events, because of the way multiple callbacks can be registered. How would: err = event_x(); if (err == -EACCESS) { be handled? [...] The default behavior would be something obvious: to trigger all callbacks and use the first non-zero return value. But how do we know which callback that was from? There's no ordering of what callbacks are called first. We do not have to know that - nor do the calling sites care in general. Do you have some specific usecase in mind where the identity of the callback that generates a match matters? Maybe I'm confused. I was thinking that these event_*() are what we currently call trace_*(), but the event_*(), I assume, can return a value if a call back returns one. Thus, we now have the ability to dynamically attach function calls to arbitrary points in the kernel that can have an affect on the code that called it. Right now, we only have the ability to attach function calls to these locations that have passive affects (tracing/profiling). But you say, nor do the calling sites care in general. Then what do these calling sites do with the return code? Are we limiting these actions to security only? Or can we have some other feature. I can envision that we can make the Linux kernel quite dynamic here with self modifying code. That is, anywhere we have hooks, perhaps we could replace them with dynamic switches (jump labels). Maybe events would not be the best use, but they could be a generic one. Knowing what callback returned the result would be beneficial. Right now, you are saying if the call back return anything, just abort the call, not knowing what callback was called. -- Steve ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* James Morris jmor...@namei.org wrote: On Tue, 17 May 2011, Ingo Molnar wrote: I'm not sure i get your point. Your example was not complete as described. After an apparently simple specification, you've since added several qualifiers and assumptions, [...] I havent added any qualifiers really (i added examples/description), the opt-in method i mentioned in my first mail should be pretty robust: | Firstly, using the filter code i deny the various link creation syscalls so | that sandboxed code cannot escape for example by creating a symlink to | outside the permitted VFS namespace. (Note: we opt-in to syscalls, that way | new syscalls added by new kernels are denied by defalt. The current symlink | creation syscalls are not opted in to.) [...] and I still doubt that it's complete. I could too claim that i doubt that the SELinux kernel implementation is secure! So how about we both come up with specific examples about how it's not secure, instead of going down the fear-uncertainty-and-doubt road? ;-) A higher level goal would look like Allow a sandbox app access only to approved resources, to contain the effects of flaws in the app, or similar. I see what you mean. I really think that restricting sandboxed code to only open files within a given VFS namespace boundary is the most useful highlevel description here - which is really a subset of a allow a sandbox app access only to an easily approved set of files highlevel concept. There's no to contain ... bit here: *all* of the sandboxed app code is untrusted, so there's no 'remote attacker' and we do not limit our threat to flaws in the app. We want to contain apps to within a small subset of Linux functionality, and we want to do that within regular apps (without having to be superuser), full stop. Note that this includes a threat model (remote attacker taking control of the app) and a general and fully stated strategy for dealing with it. Attacker does not have to be remote - most sandboxing concepts protect against locally installed plugins/apps/applets. In sandboxing the whole app is considered untrusted - not just some flaw in it, abused remotely. From there, you can start to analyze how to implement the goal, at which point you'd start thinking about configuration, assumptions, filesystem access, namespaces, indirect access (e.g. via sockets, rpc, ipc, shared memory, invocation). Sandboxed code generally does not have access to anything fancy like that - if it is added then all possible side effects have to be examined. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Arnd Bergmann a...@arndb.de wrote: On Saturday 14 May 2011, Will Drewry wrote: Depending on integration, it could even be limited to ioctl commands that are appropriate to a known fd if the fd is opened prior to entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed with a filter of 1 then narrowed through a later addition of something like (fd == %u (cmd == %u || cmd == %u)) or something along those lines. Does that make sense? Thanks for the explanation. This sounds like it's already doing all we need. One thing we could do more clearly here is to help keep the filter expressions symbolic - i.e. help resolve the various ioctl variants as well, not just the raw syscall parameter numbers. But yes, access to the raw syscall parameters and the ability to filter them already gives us the ability to exclude/include specific ioctls in a rather flexible way. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Will Drewry w...@chromium.org wrote: Note, i'm not actually asking for the moon, a pony and more. I fully submit that we are yet far away from being able to do a full LSM via this mechanism. What i'm asking for is that because the syscall point steps taken by Will look very promising so it would be nice to do *that* in a slightly more flexible scheme that does not condemn it to be limited to the syscall boundary and such ... What do you suggest here? From my brief exploration of the ftrace/perf (and seccomp) code, I don't see a clean way of integrating over the existing interfaces to the ftrace framework (e.g., the global perf event pump seems to be a mismatch), but I may be missing something obvious. [...] Well, there's no global perf event pump. Here we'd obviously want to use buffer-less events that do no output (pumping) whatsoever - i.e. just counting events with filters attached. What i suggest is to: - share syscall events# you are fine with that as your patch makes use of them - share the scripting engine # you are fine with that as your patch makes use of it - share *any* other event to do_exit() a process at syscall exit time - share any other active event that kernel developers specifically enable for active use to impact security-relevant execution even sooner than syscall exit time - not just system calls - share the actual facility that manages (sets/gets) filters So right now you have this structure for your new feature: Documentation/trace/seccomp_filter.txt | 75 + arch/x86/kernel/ldt.c |5 arch/x86/kernel/tls.c |7 fs/proc/array.c| 21 + fs/proc/base.c | 25 + include/linux/ftrace_event.h |9 include/linux/seccomp.h| 98 +++ include/linux/syscalls.h | 54 ++-- include/trace/syscall.h|6 kernel/Makefile|1 kernel/fork.c |8 kernel/perf_event.c|7 kernel/seccomp.c | 144 ++- kernel/seccomp_filter.c| 428 + kernel/sys.c | 11 kernel/trace/Kconfig | 10 kernel/trace/trace_events_filter.c | 60 ++-- kernel/trace/trace_syscalls.c | 96 ++- 18 files changed, 986 insertions(+), 79 deletions(-) Firstly, one specific problem i can see is that kernel/seccomp_filter.c hardcodes to the system call boundary. Which is fine to a prototype implementation (and obviously fine for something that builds upon seccomp) but not fine in terms of a flexible Linux security feature :-) You have hardcoded these syscall assumptions via: struct seccomp_filter { struct list_head list; struct rcu_head rcu; int syscall_nr; struct syscall_metadata *data; struct event_filter *event_filter; }; Which comes just because you chose to enumerate only syscall events - instead of enumerating all events. Instead of that please bind to all events instead - syscalls are just one of the many events we have. Type 'perf list' and see how many event types we have, and quite a few could be enabled for 'active feedback' sandboxing as well. Secondly, and this is really a variant of the first problem you have, the way you process event filter 'failures' is pretty limited. You utilize the regular seccomp method which works by calling into __secure_computing() and silently accepting syscalls or generating a hard do_exit() on even the slightest of filter failures. Instead of that what we'd want to have is to have regular syscall events registered, *and* enabled for such active filtering purposes. The moment the filter hits such an 'active' event would set the TIF_NOTIFY_RESUME flag and some other attribute in the task and the kernel would do a do_exit() at the earliest of opportunities before calling the syscall or at the return-from-syscall point latest. Note that no seccomp specific code would have to execute here, we can already generate events both at syscall entry and at syscall exit points, the only new bit we'd need is for the 'kill the task' [or abort the syscall] policy action. This is *far* more generic still yields the same short-term end result as far as your sandboxing is concerned. What we'd need for this is a way to mark existing TRACE_EVENT()s as 'active'. We'd mark all syscall events as 'active' straight away. [ Detail: these trace events return a return code, which the calling code can use, that way event return values could be used sooner than syscall exit points. IRQ code could make use of it as well, so for example this way we could filter based early packet inspection, still in the IRQ code. ] Then what your feature would do is to simply open up the events you are
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Will Drewry w...@chromium.org wrote: I agree with you on many of these points! However, I don't think that the views around LSMs, perf/ftrace infrastructure, or the current seccomp filtering implementation are necessarily in conflict. Here is my understanding of how the different worlds fit together and where I see this patchset living, along with where I could see future work going. Perhaps I'm being a trifle naive, but here goes anyway: 1. LSMs provide a global mechanism for hooking security relevant events at a point where all the incoming user-sourced data has been preprocessed and moved into userspace. The hooks are called every time one of those boundaries are crossed. 2. Perf and the ftrace infrastructure provide global function tracing and system call hooks with direct access to the caller's registers (and memory). No, perf events are not just global but per task as well. Nor are events limited to 'tracing' (generating a flow of events into a trace buffer) - they can just be themselves as well and count and generate callbacks. The generic NMI watchdog uses that kind of event model for example, see kernel/watchdog.c and how it makes use of struct perf_event abstractions to do per CPU events (with no buffrs), or how kernel/hw_breakpoint.c uses it for per task events and integrates it with the ptrace hw-breakpoints code. Ideally Peter's one particular suggestion is right IMO and we'd want to be able for a perf_event to just be a list of callbacks, attached to a task and barely more than a discoverable, named notifier chain in its slimmest form. In practice it's fatter than that right now, but we should definitely factor out that aspect of it more clearly, both code-wise and API-wise. kernel/watchdog.c and kernel/hw_breakpoint.c shows that such factoring out is possible and desirable. 3. seccomp (as it exists today) provides a global system call entry hook point with a binary per-process decision about whether to provide secure computing behavior. When I boil that down to abstractions, I see: A. Globally scoped: LSMs, ftrace/perf B. Locally/process scoped: seccomp Ok, i see where you got the idea that you needed to cut your surface of abstraction at the filter engine / syscall enumeration level - i think you were thinking of it in the ftrace model of tracepoints, not in the perf model of events. No, events are generic and as such per task as well, not just global. I've replied to your other mail with more specific suggestions of how we could provide your feature using abstractions that share code more widely. Talking specifics will i hope help move the discussion forward! :-) Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Mon, May 16, 2011 at 7:55 AM, Ingo Molnar mi...@elte.hu wrote: * Will Drewry w...@chromium.org wrote: I agree with you on many of these points! However, I don't think that the views around LSMs, perf/ftrace infrastructure, or the current seccomp filtering implementation are necessarily in conflict. Here is my understanding of how the different worlds fit together and where I see this patchset living, along with where I could see future work going. Perhaps I'm being a trifle naive, but here goes anyway: 1. LSMs provide a global mechanism for hooking security relevant events at a point where all the incoming user-sourced data has been preprocessed and moved into userspace. The hooks are called every time one of those boundaries are crossed. 2. Perf and the ftrace infrastructure provide global function tracing and system call hooks with direct access to the caller's registers (and memory). No, perf events are not just global but per task as well. Nor are events limited to 'tracing' (generating a flow of events into a trace buffer) - they can just be themselves as well and count and generate callbacks. I was looking at the perf_sysenter_enable() call, but clearly there is more going on :) The generic NMI watchdog uses that kind of event model for example, see kernel/watchdog.c and how it makes use of struct perf_event abstractions to do per CPU events (with no buffrs), or how kernel/hw_breakpoint.c uses it for per task events and integrates it with the ptrace hw-breakpoints code. Ideally Peter's one particular suggestion is right IMO and we'd want to be able for a perf_event to just be a list of callbacks, attached to a task and barely more than a discoverable, named notifier chain in its slimmest form. In practice it's fatter than that right now, but we should definitely factor out that aspect of it more clearly, both code-wise and API-wise. kernel/watchdog.c and kernel/hw_breakpoint.c shows that such factoring out is possible and desirable. 3. seccomp (as it exists today) provides a global system call entry hook point with a binary per-process decision about whether to provide secure computing behavior. When I boil that down to abstractions, I see: A. Globally scoped: LSMs, ftrace/perf B. Locally/process scoped: seccomp Ok, i see where you got the idea that you needed to cut your surface of abstraction at the filter engine / syscall enumeration level - i think you were thinking of it in the ftrace model of tracepoints, not in the perf model of events. No, events are generic and as such per task as well, not just global. I've replied to your other mail with more specific suggestions of how we could provide your feature using abstractions that share code more widely. Talking specifics will i hope help move the discussion forward! :-) Agreed. I'll digest both the watchdog code as well as your other comments and follow up when I have a fuller picture in my head. (I have a few initial comments I'll post in response to your other mail.) Thanks! will ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* James Morris jmor...@namei.org wrote: On Fri, 13 May 2011, Ingo Molnar wrote: Say i'm a user-space sandbox developer who wants to enforce that sandboxed code should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/. It is a simple and sensible security feature, agreed? It allows most code to run well and link to countless libraries - but no access to other files is allowed. Not really. Firstly, what is the security goal of these restrictions? [...] To do what i described above? Namely: Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/ [...] Then, are the restrictions complete and unbypassable? If only the system calls i mentioned are allowed, and if the sandboxed VFS namespace itself is isolated from the rest of the system (no bind mounts, no hard links outside the sandbox, etc.) then its goal is to not be bypassable - what use is a sandbox if the sandbox can be bypassed by the sandboxed code? There's a few ways how to alter (and thus bypass) VFS namespace lookups: symlinks, chdir, chroot, rename, etc., which (as i mentioned) have to be excluded by default or filtered as well. How do you reason about the behavior of the system as a whole? For some usecases i mainly want to reason about what the sandboxed code can do and can not do, within a fairly static and limited VFS namespace environment. I might not want to have a full-blown 'physical barrier' for all objects labeled as inaccessible to sandboxed code (or labeled as accessible to sandboxed code). Especially as manipulating file labels is not also slow (affects all files) but is also often an exclusively privileged operation even for owned files, for no good reason. For things like /lib/ and /usr/lib/ it also *has* to be a privileged operation. I argue that this is the LSM and audit subsystems designed right: in the long run it could allow everything that LSM does at the moment - and so much more ... Now you're proposing a redesign of the security subsystem. That's a significant undertaking. It certainly is. In the meantime, we have a simple, well-defined enhancement to seccomp which will be very useful to current users in reducing their kernel attack surface. We should merge that, and the security subsystem discussion can carry on separately. Is that the development and merge process along which the LSM subsystem got into its current state? Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Mon, May 16, 2011 at 10:26 AM, Steven Rostedt rost...@goodmis.org wrote: Sorry to be absent from this thread so far, I just got back from my travels and I'm now catching up on email. On Wed, 2011-05-11 at 22:02 -0500, Will Drewry wrote: diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 377a7a5..22e1668 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1664,6 +1664,16 @@ config SECCOMP and the task is only allowed to execute a few safe syscalls defined by each seccomp mode. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + config CC_STACKPROTECTOR bool Enable -fstack-protector buffer overflow detection (EXPERIMENTAL) depends on EXPERIMENTAL diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig index eccdefe..7641ee9 100644 --- a/arch/microblaze/Kconfig +++ b/arch/microblaze/Kconfig @@ -129,6 +129,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + endmenu menu Advanced setup diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 8e256cc..fe4cbda 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -2245,6 +2245,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + config USE_OF bool Flattened Device Tree support select OF diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 8f4d50b..83499e4 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -605,6 +605,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + endmenu config ISA_DMA_API diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 2508a6f..2777515 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -614,6 +614,16 @@ config SECCOMP If unsure, say Y. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + endmenu menu Power Management diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index 4b89da2..00c1521 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -676,6 +676,16 @@ config SECCOMP If unsure, say N. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + config SMP bool Symmetric multi-processing support depends on SYS_SUPPORTS_SMP diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index e560d10..5b42255 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -270,6 +270,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code +
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Mon, May 16, 2011 at 7:43 AM, Ingo Molnar mi...@elte.hu wrote: * Will Drewry w...@chromium.org wrote: Note, i'm not actually asking for the moon, a pony and more. I fully submit that we are yet far away from being able to do a full LSM via this mechanism. What i'm asking for is that because the syscall point steps taken by Will look very promising so it would be nice to do *that* in a slightly more flexible scheme that does not condemn it to be limited to the syscall boundary and such ... What do you suggest here? From my brief exploration of the ftrace/perf (and seccomp) code, I don't see a clean way of integrating over the existing interfaces to the ftrace framework (e.g., the global perf event pump seems to be a mismatch), but I may be missing something obvious. [...] Well, there's no global perf event pump. Here we'd obviously want to use buffer-less events that do no output (pumping) whatsoever - i.e. just counting events with filters attached. Cool - I missed these entirely. I'll get reading :) What i suggest is to: - share syscall events # you are fine with that as your patch makes use of them - share the scripting engine # you are fine with that as your patch makes use of it - share *any* other event to do_exit() a process at syscall exit time - share any other active event that kernel developers specifically enable for active use to impact security-relevant execution even sooner than syscall exit time - not just system calls - share the actual facility that manages (sets/gets) filters These make sense to me at a high level. I'll throw in a few initial comments, but I'll be back for a round-two once I catch up on the rest of the perf code. So right now you have this structure for your new feature: Documentation/trace/seccomp_filter.txt | 75 + arch/x86/kernel/ldt.c | 5 arch/x86/kernel/tls.c | 7 fs/proc/array.c | 21 + fs/proc/base.c | 25 + include/linux/ftrace_event.h | 9 include/linux/seccomp.h | 98 +++ include/linux/syscalls.h | 54 ++-- include/trace/syscall.h | 6 kernel/Makefile | 1 kernel/fork.c | 8 kernel/perf_event.c | 7 kernel/seccomp.c | 144 ++- kernel/seccomp_filter.c | 428 + kernel/sys.c | 11 kernel/trace/Kconfig | 10 kernel/trace/trace_events_filter.c | 60 ++-- kernel/trace/trace_syscalls.c | 96 ++- 18 files changed, 986 insertions(+), 79 deletions(-) Firstly, one specific problem i can see is that kernel/seccomp_filter.c hardcodes to the system call boundary. Which is fine to a prototype implementation (and obviously fine for something that builds upon seccomp) but not fine in terms of a flexible Linux security feature :-) You have hardcoded these syscall assumptions via: struct seccomp_filter { struct list_head list; struct rcu_head rcu; int syscall_nr; struct syscall_metadata *data; struct event_filter *event_filter; }; (This structure is a bit different in the new rev of the patch, but nothing relevant to this specific part of the discussion :) Which comes just because you chose to enumerate only syscall events - instead of enumerating all events. While I'm willing to live with the tradeoff, using ftrace event numbers from FTRACE_SYSCALLS means that the filter will be unable to hook a fair number of syscalls: execve, clone, etc (all the ptregs fixup syscalls on x86) and anything that returns int instead of long (on x86). Though the last two patches in the initial patch series provided a proposed clean up for the latter :/ The current revision of the seccomp filter patch can function in a bitmask-like state when CONFIG_FTRACE is unset or CONFIG_FTRACE_SYSCALLS is unset. This also means any platform with CONFIG_SECCOMP support can start using this right away, but I realize that is more of a short-term gain rather than a long-term one. Instead of that please bind to all events instead - syscalls are just one of the many events we have. Type 'perf list' and see how many event types we have, and quite a few could be enabled for 'active feedback' sandboxing as well. Secondly, and this is really a variant of the first problem you have, the way you process event filter 'failures' is pretty limited. You utilize the regular seccomp method which works by calling into __secure_computing() and silently accepting syscalls or generating a hard do_exit() on even the slightest of filter failures. Instead of that what we'd want to have is to have regular syscall events registered, *and* enabled for such active filtering purposes.
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
Sorry to be absent from this thread so far, I just got back from my travels and I'm now catching up on email. On Wed, 2011-05-11 at 22:02 -0500, Will Drewry wrote: diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 377a7a5..22e1668 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1664,6 +1664,16 @@ config SECCOMP and the task is only allowed to execute a few safe syscalls defined by each seccomp mode. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + config CC_STACKPROTECTOR bool Enable -fstack-protector buffer overflow detection (EXPERIMENTAL) depends on EXPERIMENTAL diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig index eccdefe..7641ee9 100644 --- a/arch/microblaze/Kconfig +++ b/arch/microblaze/Kconfig @@ -129,6 +129,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + endmenu menu Advanced setup diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 8e256cc..fe4cbda 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -2245,6 +2245,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + config USE_OF bool Flattened Device Tree support select OF diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 8f4d50b..83499e4 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -605,6 +605,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + endmenu config ISA_DMA_API diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 2508a6f..2777515 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -614,6 +614,16 @@ config SECCOMP If unsure, say Y. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + endmenu menu Power Management diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index 4b89da2..00c1521 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -676,6 +676,16 @@ config SECCOMP If unsure, say N. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + config SMP bool Symmetric multi-processing support depends on SYS_SUPPORTS_SMP diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index e560d10..5b42255 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -270,6 +270,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config SECCOMP_FILTER + bool Enable seccomp-based system call filtering + depends on SECCOMP EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Steven Rostedt rost...@goodmis.org wrote: I'm a bit nervous about the 'active' role of (trace_)events, because of the way multiple callbacks can be registered. How would: err = event_x(); if (err == -EACCESS) { be handled? [...] The default behavior would be something obvious: to trigger all callbacks and use the first non-zero return value. [...] Would we need a way to prioritize which call back gets the return value? One way I guess would be to add a check_event option, where you pass in an ENUM of the event you want: event_x(); err = check_event_x(MYEVENT); If something registered itself as MYEVENT to event_x, then you get the return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or something could be returned. I'm sure we could even optimize it such a way if no active events have been registered to event_x, that check_event_x() will return -ENODEV without any branches. I would keep it simple and extensible - that way we can complicate it when the need arises! :) Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote: * Steven Rostedt rost...@goodmis.org wrote: I'm a bit nervous about the 'active' role of (trace_)events, because of the way multiple callbacks can be registered. How would: err = event_x(); if (err == -EACCESS) { be handled? [...] The default behavior would be something obvious: to trigger all callbacks and use the first non-zero return value. But how do we know which callback that was from? There's no ordering of what callbacks are called first. [...] Would we need a way to prioritize which call back gets the return value? One way I guess would be to add a check_event option, where you pass in an ENUM of the event you want: event_x(); err = check_event_x(MYEVENT); If something registered itself as MYEVENT to event_x, then you get the return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or something could be returned. I'm sure we could even optimize it such a way if no active events have been registered to event_x, that check_event_x() will return -ENODEV without any branches. I would keep it simple and extensible - that way we can complicate it when the need arises! :) The above is rather trivial to implement. I don't think it complicates anything. -- Steve ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Mon, 16 May 2011, Ingo Molnar wrote: Not really. Firstly, what is the security goal of these restrictions? [...] To do what i described above? Namely: Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/ These are access rules, they don't really describe a high-level security goal. How do you know it's ok to open everything in these directories? - James -- James Morris jmor...@namei.org ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Saturday 14 May 2011, Will Drewry wrote: Depending on integration, it could even be limited to ioctl commands that are appropriate to a known fd if the fd is opened prior to entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed with a filter of 1 then narrowed through a later addition of something like (fd == %u (cmd == %u || cmd == %u)) or something along those lines. Does that make sense? Thanks for the explanation. This sounds like it's already doing all we need. Arnd ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Fri, 13 May 2011, Ingo Molnar wrote: Say i'm a user-space sandbox developer who wants to enforce that sandboxed code should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/. It is a simple and sensible security feature, agreed? It allows most code to run well and link to countless libraries - but no access to other files is allowed. Not really. Firstly, what is the security goal of these restrictions? Then, are the restrictions complete and unbypassable? How do you reason about the behavior of the system as a whole? I argue that this is the LSM and audit subsystems designed right: in the long run it could allow everything that LSM does at the moment - and so much more ... Now you're proposing a redesign of the security subsystem. That's a significant undertaking. In the meantime, we have a simple, well-defined enhancement to seccomp which will be very useful to current users in reducing their kernel attack surface. We should merge that, and the security subsystem discussion can carry on separately. - James -- James Morris jmor...@namei.org ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Peter Zijlstra pet...@infradead.org wrote: On Fri, 2011-05-13 at 16:57 +0200, Ingo Molnar wrote: this is a security mechanism Who says? [...] Kernel developers/maintainers of the affected code. We have security hooks all around the kernel, which can deny/accept execution at various key points, but we do not have 'execute arbitrary user-space defined (safe) scripts' callbacks in general. But yes, if a particular callback point is defined widely enough to allow much bigger intervention into the flow of execution, then more is possible as well. [...] and why would you want to unify two separate concepts only to them limit it to security that just doesn't make sense. I don't limit them to security - the callbacks themselves are either for passive observation or, at most, for security accept/deny callbacks. It's decided by the subsystem maintainers what kind of user-space control power (or observation power) they want to allow, not me. I would just like to not stop the facility itself at the 'observe only' level, like you suggest. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Sat, May 14, 2011 at 2:30 AM, Ingo Molnar mi...@elte.hu wrote: * Eric Paris epa...@redhat.com wrote: [dropping microblaze and roland] lOn Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote: * James Morris jmor...@namei.org wrote: It is a simple and sensible security feature, agreed? It allows most code to run well and link to countless libraries - but no access to other files is allowed. It's simple enough and sounds reasonable, but you can read all the discussion about AppArmour why many people don't really think it's the best. [...] I have to say most of the participants of the AppArmour flamefests were dead wrong, and it wasnt the AppArmour folks who were wrong ... The straight ASCII VFS namespace *makes sense*, and yes, the raw physical objects space that SELinux uses makes sense as well. And no, i do not subscribe to the dogma that it is not possible to secure the ASCII VFS namespace: it evidently is possible, if you know and handle the ambiguitites. It is also obviously true that the ASCII VFS namespaces we use every day are a *lot* more intuitive than the labeled physical objects space ... What all the security flamewars missed is the simple fact that being intuitive matters a *lot* not just to not annoy users, but also to broaden the effective number of security-conscious developers ... Unfortunately this audit callback cannot be used for my purposes, because the event is single-purpose for auditd and because it allows no feedback (no deny/accept discretion for the security policy). But if had this simple event there: err = event_vfs_getname(result); Wow it sounds so easy. Now lets keep extending your train of thought until we can actually provide the security provided by SELinux. What do we end up with? We end up with an event hook right next to every LSM hook. You know, the LSM hooks were placed where they are for a reason. Because those were the locations inside the kernel where you actually have information about the task doing an operation and the objects (files, sockets, directories, other tasks, etc) they are doing an operation on. Honestly all you are talking about it remaking the LSM with 2 sets of hooks instead if 1. Why? [...] Not at all. I am taking about using *one* set of events, to keep the intrusion at the lowest possible level. LSM could make use of them as well. Obviously for pragmatic reasons that might not be feasible initially. [...] It seems much easier that if you want the language of the filter engine you would just make a new LSM that uses the filter engine for it's policy language rather than the language created by SELinux or SMACK or name your LSM implementation. Correct, that is what i suggested. Note that performance is a primary concern, so if certain filters are very popular then in practice this would come with support for a couple of 'built in' (pre-optimized) filters that the kernel can accelerate directly, so that we do not incure the cost of executing the filter preds for really common-sense security policies that almost everyone is using. I.e. in the end we'd *roughly* end up with the same performance and security as we are today (i mean, SELinux and the other LSMs did a nice job of collecting the things that apps should be careful about), but the crutial difference isnt just the advantages i menioned, but the fact that the *development model* of security modules would be a *lot* more extensible. So security efforts could move to a whole different level: they could move into key apps and they could integrate with the general mind-set of developers. At least Will as an application framework developer cares, so that hope is justified i think. - unprivileged: application-definable, allowing the embedding of security policy in *apps* as well, not just the system - flexible: can be added/removed runtime unprivileged, and cheaply so - transparent: does not impact executing code that meets the policy - nestable: it is inherited by child tasks and is fundamentally stackable, multiple policies will have the combined effect and they are transparent to each other. So if a child task within a sandbox adds *more* checks then those add to the already existing set of checks. We only narrow permissions, never extend them. - generic: allowing observation and (safe) control of security relevant parameters not just at the system call boundary but at other relevant places of kernel execution as well: which points/callbacks could also be used for other types of event extraction such as perf. It could even be shared with audit ... I'm not arguing that any of these things are bad things. What you describe is a
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Thursday 12 May 2011, Will Drewry wrote: This change adds a new seccomp mode based on the work by a...@chromium.org in [1]. This new mode, filter mode, provides a hash table of seccomp_filter objects. When in the new mode (2), all system calls are checked against the filters - first by system call number, then by a filter string. If an entry exists for a given system call and all filter predicates evaluate to true, then the task may proceed. Otherwise, the task is killed (as per seccomp_mode == 1). I've got a question about this: Do you expect the typical usage to disallow ioctl()? Given that ioctl alone is responsible for a huge number of exploits in various drivers, while certain ioctls are immensely useful (FIONREAD, FIOASYNC, ...), do you expect to extend the mechanism to filter specific ioctl commands in the future? Arnd ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Fri, May 13, 2011 at 2:35 PM, Arnd Bergmann a...@arndb.de wrote: On Thursday 12 May 2011, Will Drewry wrote: This change adds a new seccomp mode based on the work by a...@chromium.org in [1]. This new mode, filter mode, provides a hash table of seccomp_filter objects. When in the new mode (2), all system calls are checked against the filters - first by system call number, then by a filter string. If an entry exists for a given system call and all filter predicates evaluate to true, then the task may proceed. Otherwise, the task is killed (as per seccomp_mode == 1). I've got a question about this: Do you expect the typical usage to disallow ioctl()? Given that ioctl alone is responsible for a huge number of exploits in various drivers, while certain ioctls are immensely useful (FIONREAD, FIOASYNC, ...), do you expect to extend the mechanism to filter specific ioctl commands in the future? In many cases, I do expect ioctl's to be dropped, but it's totally up to whoever is setting the filters. As is, it can already help out: [even though an LSM, if available, would be appropriate to define a fine-grained policy] ioctl() is hooked by the ftrace syscalls infrastructure (via SYSCALL_DEFINE3): SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) This means you can do: sprintf(filter, cmd == %u || cmd == %u, FIOASYNC, FIONREAD); prctl(PR_SET_SECCOMP_FILTER, __NR_ioctl, filter); ... prctl(PR_SET_SECCOMP, 2, 0); and then you'll be able to call ioctl on any fd with any argument but limited to only the FIOASYNC and FIONREAD commands. Depending on integration, it could even be limited to ioctl commands that are appropriate to a known fd if the fd is opened prior to entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed with a filter of 1 then narrowed through a later addition of something like (fd == %u (cmd == %u || cmd == %u)) or something along those lines. Does that make sense? In general, this interface won't need specific extensions for most system call oriented filtering events. ftrace events may be expanded (to include more system calls), but that's behind the scenes. Only arguments subject to time-of-check-time-of-use attacks (data living in userspace passed in by pointer) are not safe to use via this interface. In theory, that limitation could also be lifted in the implementation without changing the ABI. Thanks! will ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Peter Zijlstra pet...@infradead.org wrote: On Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote: err = event_vfs_getname(result); I really think we should not do this. Events like we have them should be inactive, totally passive entities, only observe but not affect execution (other than the bare minimal time delay introduced by observance). Well, this patchset already demonstrates that we can use a single event callback for a rather useful purpose. Either it makes sense to do, in which case we should share facilities as much as possible, or it makes no sense, in which case we should not merge it at all. If you want another entity that is more active, please invent a new name for it and create a new subsystem for them, now you could have these active entities also have an (automatic) passive event side, but that's some detail. Why should we have two callbacks next to each other: event_vfs_getname(result); result = check_event_vfs_getname(result); if one could do it all? Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Fri, 2011-05-13 at 14:26 +0200, Ingo Molnar wrote: * Peter Zijlstra pet...@infradead.org wrote: On Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote: err = event_vfs_getname(result); I really think we should not do this. Events like we have them should be inactive, totally passive entities, only observe but not affect execution (other than the bare minimal time delay introduced by observance). Well, this patchset already demonstrates that we can use a single event callback for a rather useful purpose. Can and should are two distinct things. Either it makes sense to do, in which case we should share facilities as much as possible, or it makes no sense, in which case we should not merge it at all. And I'm arguing we should _not_. Observing is radically different from Affecting, at the very least the two things should have different permission schemes. We should not confuse these two matters. If you want another entity that is more active, please invent a new name for it and create a new subsystem for them, now you could have these active entities also have an (automatic) passive event side, but that's some detail. Why should we have two callbacks next to each other: event_vfs_getname(result); result = check_event_vfs_getname(result); if one could do it all? Did you actually read the bit where I said that check_event_* (although I still think that name sucks) could imply a matching event_*? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Fri, 2011-05-13 at 14:39 +0200, Peter Zijlstra wrote: event_vfs_getname(result); result = check_event_vfs_getname(result); Another fundamental difference is how to treat the callback chains for these two. Observers won't have a return value and are assumed to never fail, therefore we can always call every entry on the callback list. Active things otoh do have a return value, and thus we need to have semantics that define what to do with that during callback iteration, when to continue and when to break. Thus for active elements its impossible to guarantee all entries will indeed be called. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Peter Zijlstra pet...@infradead.org wrote: Why should we have two callbacks next to each other: event_vfs_getname(result); result = check_event_vfs_getname(result); if one could do it all? Did you actually read the bit where I said that check_event_* (although I still think that name sucks) could imply a matching event_*? No, did not notice that - and yes that solves this particular problem. So given that by your own admission it makes sense to share the facilities at the low level, i also argue that it makes sense to share as high up as possible. Are you perhaps arguing for a -observe flag that would make 100% sure that the default behavior for events is observe-only? That would make sense indeed. Otherwise both cases really want to use all the same facilities for event discovery, setup, control and potential extraction of events. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Peter Zijlstra pet...@infradead.org wrote: On Fri, 2011-05-13 at 14:39 +0200, Peter Zijlstra wrote: event_vfs_getname(result); result = check_event_vfs_getname(result); Another fundamental difference is how to treat the callback chains for these two. Observers won't have a return value and are assumed to never fail, therefore we can always call every entry on the callback list. Active things otoh do have a return value, and thus we need to have semantics that define what to do with that during callback iteration, when to continue and when to break. Thus for active elements its impossible to guarantee all entries will indeed be called. I think the sanest semantics is to run all active callbacks as well. For example if this is used for three stacked security policies - as if 3 LSM modules were stacked at once. We'd call all three, and we'd determine that at least one failed - and we'd return a failure. Even if the first one failed already we'd still want to trigger *all* the failures, because security policies like to know when they have triggered a failure (regardless of other active policies) and want to see that failure event (if they are logging such events). So to me this looks pretty similar to observer callbacks as well, it's the natural extension to an observer callback chain. Observer callbacks are simply constant functions (to the caller), those which never return failure and which never modify any of the parameters. It's as if you argued that there should be separate syscalls/facilities for handling readonly files versus handling read/write files. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote: I think the sanest semantics is to run all active callbacks as well. For example if this is used for three stacked security policies - as if 3 LSM modules were stacked at once. We'd call all three, and we'd determine that at least one failed - and we'd return a failure. But that only works for boolean functions where you can return the multi-bit-or of the result. What if you need to return the specific error code. Also, there's bound to be other cases where people will want to employ this, look at all the various notifier chain muck we've got, it already deals with much of this -- simply because users need it. Then there's the whole indirection argument, if you don't need indirection, its often better to not use it, I myself much prefer code to look like: foo1(bar); foo2(bar); foo3(bar); Than: foo_notifier(bar); Simply because its much clearer who all are involved without me having to grep around to see who registers for foo_notifier and wth they do with it. It also makes it much harder to sneak in another user, whereas its nearly impossible to find new notifier users. Its also much faster, no extra memory accesses, no indirect function calls, no other muck. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Peter Zijlstra pet...@infradead.org wrote: On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote: I think the sanest semantics is to run all active callbacks as well. For example if this is used for three stacked security policies - as if 3 LSM modules were stacked at once. We'd call all three, and we'd determine that at least one failed - and we'd return a failure. But that only works for boolean functions where you can return the multi-bit-or of the result. What if you need to return the specific error code. Do you mean that one filter returns -EINVAL while the other -EACCES? Seems like a non-problem to me, we'd return the first nonzero value. Also, there's bound to be other cases where people will want to employ this, look at all the various notifier chain muck we've got, it already deals with much of this -- simply because users need it. Do you mean it would be easy to abuse it? What kind of abuse are you most worried about? Then there's the whole indirection argument, if you don't need indirection, its often better to not use it, I myself much prefer code to look like: foo1(bar); foo2(bar); foo3(bar); Than: foo_notifier(bar); Simply because its much clearer who all are involved without me having to grep around to see who registers for foo_notifier and wth they do with it. It also makes it much harder to sneak in another user, whereas its nearly impossible to find new notifier users. Its also much faster, no extra memory accesses, no indirect function calls, no other muck. But i suspect this question has been settled, given the fact that even pure observer events need and already process a chain of events? Am i missing something about your argument? Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Fri, 2011-05-13 at 14:49 +0200, Ingo Molnar wrote: So given that by your own admission it makes sense to share the facilities at the low level, i also argue that it makes sense to share as high up as possible. I'm not saying any such thing, I'm saying that it might make sense to observe active objects and auto-create these observation points. That doesn't make them similar or make them share anything. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
Cut the microblaze list since its bouncy. On Fri, 2011-05-13 at 15:18 +0200, Ingo Molnar wrote: * Peter Zijlstra pet...@infradead.org wrote: On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote: I think the sanest semantics is to run all active callbacks as well. For example if this is used for three stacked security policies - as if 3 LSM modules were stacked at once. We'd call all three, and we'd determine that at least one failed - and we'd return a failure. But that only works for boolean functions where you can return the multi-bit-or of the result. What if you need to return the specific error code. Do you mean that one filter returns -EINVAL while the other -EACCES? Seems like a non-problem to me, we'd return the first nonzero value. Assuming the first is -EINVAL, what then is the value in computing the -EACCESS? Sounds like a massive waste of time to me. Also, there's bound to be other cases where people will want to employ this, look at all the various notifier chain muck we've got, it already deals with much of this -- simply because users need it. Do you mean it would be easy to abuse it? What kind of abuse are you most worried about? I'm not worried about abuse, I'm saying that going by the existing notifier pattern always visiting all entries on the callback list is undesired. Then there's the whole indirection argument, if you don't need indirection, its often better to not use it, I myself much prefer code to look like: foo1(bar); foo2(bar); foo3(bar); Than: foo_notifier(bar); Simply because its much clearer who all are involved without me having to grep around to see who registers for foo_notifier and wth they do with it. It also makes it much harder to sneak in another user, whereas its nearly impossible to find new notifier users. Its also much faster, no extra memory accesses, no indirect function calls, no other muck. But i suspect this question has been settled, given the fact that even pure observer events need and already process a chain of events? Am i missing something about your argument? I'm saying that there's reasons to not use notifiers passive or active. Mostly the whole notifier/indirection muck comes up once you want modules to make use of the thing, because then you need dynamic management of the callback list. (Then again, I'm fairly glad we don't have explicit callbacks in kernel/cpu.c for all the cpu-hotplug callbacks :-) Anyway, I oppose for the existing events to gain an active role. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Peter Zijlstra pet...@infradead.org wrote: Cut the microblaze list since its bouncy. On Fri, 2011-05-13 at 15:18 +0200, Ingo Molnar wrote: * Peter Zijlstra pet...@infradead.org wrote: On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote: I think the sanest semantics is to run all active callbacks as well. For example if this is used for three stacked security policies - as if 3 LSM modules were stacked at once. We'd call all three, and we'd determine that at least one failed - and we'd return a failure. But that only works for boolean functions where you can return the multi-bit-or of the result. What if you need to return the specific error code. Do you mean that one filter returns -EINVAL while the other -EACCES? Seems like a non-problem to me, we'd return the first nonzero value. Assuming the first is -EINVAL, what then is the value in computing the -EACCESS? Sounds like a massive waste of time to me. No, because the common case is no rejection - this is a security mechanism. So in the normal case we would execute all 3 anyway, just to determine that all return 0. Are you really worried about the abnormal case of one of them returning an error and us calculating all 3 return values? Also, there's bound to be other cases where people will want to employ this, look at all the various notifier chain muck we've got, it already deals with much of this -- simply because users need it. Do you mean it would be easy to abuse it? What kind of abuse are you most worried about? I'm not worried about abuse, I'm saying that going by the existing notifier pattern always visiting all entries on the callback list is undesired. That is because many notifier chains are used in an 'event consuming' manner - they are responding to things like hardware events and are called in an interrupt-handler alike fashion most of the time. Then there's the whole indirection argument, if you don't need indirection, its often better to not use it, I myself much prefer code to look like: foo1(bar); foo2(bar); foo3(bar); Than: foo_notifier(bar); Simply because its much clearer who all are involved without me having to grep around to see who registers for foo_notifier and wth they do with it. It also makes it much harder to sneak in another user, whereas its nearly impossible to find new notifier users. Its also much faster, no extra memory accesses, no indirect function calls, no other muck. But i suspect this question has been settled, given the fact that even pure observer events need and already process a chain of events? Am i missing something about your argument? I'm saying that there's reasons to not use notifiers passive or active. Mostly the whole notifier/indirection muck comes up once you want modules to make use of the thing, because then you need dynamic management of the callback list. But your argument assumes that we'd have a chain of functions to call, like regular notifiers. While the natural model here would be to have a list of registered event structs for that point, with different filters but basically the same callback mechanism (a call into the filter engine in essence). Also note that the common case would be no event registered - and we'd automatically optimize that case via the existing jump labels optimization. (Then again, I'm fairly glad we don't have explicit callbacks in kernel/cpu.c for all the cpu-hotplug callbacks :-) Anyway, I oppose for the existing events to gain an active role. Why if 'being active' is optional and useful? Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Peter Zijlstra pet...@infradead.org wrote: On Fri, 2011-05-13 at 14:49 +0200, Ingo Molnar wrote: So given that by your own admission it makes sense to share the facilities at the low level, i also argue that it makes sense to share as high up as possible. I'm not saying any such thing, I'm saying that it might make sense to observe active objects and auto-create these observation points. That doesn't make them similar or make them share anything. Well, they would share the lowest level call site: result = check_event_vfs_getname(result); You call it 'auto-generated call site', i call it a shared (single line) call site. The same thing as far as the lowest level goes. Now (the way i understood it) you'd want to stop the sharing right after that. I argue that it should go all the way up. Note: i fully agree that there should be events where filters can have no effect whatsoever. For example if this was written as: check_event_vfs_getname(result); Then it would have no effect. This is decided by the subsystem developers, obviously. So whether an event is 'active' or 'passive' can be enforced at the subsystem level as well. As far as the event facilities go, 'no effect observation' is a special-case of 'active observation' - just like read-only files are a special case of read-write files. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
[dropping microblaze and roland] lOn Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote: * James Morris jmor...@namei.org wrote: It is a simple and sensible security feature, agreed? It allows most code to run well and link to countless libraries - but no access to other files is allowed. It's simple enough and sounds reasonable, but you can read all the discussion about AppArmour why many people don't really think it's the best. Still, I'll agree it's a lot better than nothing. But if i had a VFS event at the fs/namei.c::getname() level, i would have access to a central point where the VFS string becomes stable to the kernel and can be checked (and denied if necessary). A sidenote, and not surprisingly, the audit subsystem already has an event callback there: audit_getname(result); Unfortunately this audit callback cannot be used for my purposes, because the event is single-purpose for auditd and because it allows no feedback (no deny/accept discretion for the security policy). But if had this simple event there: err = event_vfs_getname(result); Wow it sounds so easy. Now lets keep extending your train of thought until we can actually provide the security provided by SELinux. What do we end up with? We end up with an event hook right next to every LSM hook. You know, the LSM hooks were placed where they are for a reason. Because those were the locations inside the kernel where you actually have information about the task doing an operation and the objects (files, sockets, directories, other tasks, etc) they are doing an operation on. Honestly all you are talking about it remaking the LSM with 2 sets of hooks instead if 1. Why? It seems much easier that if you want the language of the filter engine you would just make a new LSM that uses the filter engine for it's policy language rather than the language created by SELinux or SMACK or name your LSM implementation. - unprivileged: application-definable, allowing the embedding of security policy in *apps* as well, not just the system - flexible: can be added/removed runtime unprivileged, and cheaply so - transparent: does not impact executing code that meets the policy - nestable: it is inherited by child tasks and is fundamentally stackable, multiple policies will have the combined effect and they are transparent to each other. So if a child task within a sandbox adds *more* checks then those add to the already existing set of checks. We only narrow permissions, never extend them. - generic: allowing observation and (safe) control of security relevant parameters not just at the system call boundary but at other relevant places of kernel execution as well: which points/callbacks could also be used for other types of event extraction such as perf. It could even be shared with audit ... I'm not arguing that any of these things are bad things. What you describe is a new LSM that uses a discretionary access control model but with the granularity and flexibility that has traditionally only existed in the mandatory access control security modules previously implemented in the kernel. I won't argue that's a bad idea, there's no reason in my mind that a process shouldn't be allowed to control it's own access decisions in a more flexible way than rwx bits. Then again, I certainly don't see a reason that this syscall hardening patch should be held up while a whole new concept in computer security is contemplated... -Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
[dropping microblaze and roland] On Fri, 2011-05-13 at 15:18 +0200, Ingo Molnar wrote: * Peter Zijlstra pet...@infradead.org wrote: On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote: I think the sanest semantics is to run all active callbacks as well. For example if this is used for three stacked security policies - as if 3 LSM modules were stacked at once. We'd call all three, and we'd determine that at least one failed - and we'd return a failure. But that only works for boolean functions where you can return the multi-bit-or of the result. What if you need to return the specific error code. Do you mean that one filter returns -EINVAL while the other -EACCES? Seems like a non-problem to me, we'd return the first nonzero value. Sounds so easy! Why haven't LSMs stacked already? Because what happens if one of these hooks did something stateful? Lets say on open, hook #1 returns EPERM. hook #2 allocates memory. The open is going to fail and hooks #2 is never going to get the close() which should have freed the allocation. If you can be completely stateless its easier, but there's a reason that stacking security modules is hard. Serge has tried in the past and both dhowells and casey schaufler are working on it right now. Stacking is never as easy as it sounds :) -Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Fri, 2011-05-13 at 11:10 -0400, Eric Paris wrote: Then again, I certainly don't see a reason that this syscall hardening patch should be held up while a whole new concept in computer security is contemplated... Which makes me wonder why this syscall hardening stuff is done outside of LSM? Why isn't is part of the LSM so that say SELinux can have a syscall bitmask per security context? Making it part of the LSM also avoids having to add this prctl(). ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Fri, 2011-05-13 at 16:57 +0200, Ingo Molnar wrote: this is a security mechanism Who says? and why would you want to unify two separate concepts only to them limit it to security that just doesn't make sense. Either you provide a full on replacement for notifier chain like things or you don't, only extending trace events in this fashion for security is like way weird. Plus see the arguments Eric made about stacking stuff, not only security schemes will have those problems. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* James Morris jmor...@namei.org wrote: On Thu, 12 May 2011, Ingo Molnar wrote: Funnily enough, back then you wrote this: I'm concerned that we're seeing yet another security scheme being designed on the fly, without a well-formed threat model, and without taking into account lessons learned from the seemingly endless parade of similar, failed schemes. so when and how did your opinion of this scheme turn from it being an endless parade of failed schemes to it being a well-defined and readily understandable feature? :-) When it was defined in a way which limited its purpose to reducing the attack surface of the sycall interface. Let me outline a simple example of a new filter expression based security feature that could be implemented outside the narrow system call boundary you find acceptable, and please tell what is bad about it. Say i'm a user-space sandbox developer who wants to enforce that sandboxed code should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/. It is a simple and sensible security feature, agreed? It allows most code to run well and link to countless libraries - but no access to other files is allowed. I would also like my sandbox app to be able to install this policy without having to be root. I do not want the sandbox app to have permission to create labels on /lib and /usr/lib and what not. Firstly, using the filter code i deny the various link creation syscalls so that sandboxed code cannot escape for example by creating a symlink to outside the permitted VFS namespace. (Note: we opt-in to syscalls, that way new syscalls added by new kernels are denied by defalt. The current symlink creation syscalls are not opted in to.) But the next step, actually checking filenames, poses a big hurdle: i cannot implement the filename checking at the sys_open() syscall level in a secure way: because the pathname is passed to sys_open() by pointer, and if i check it at the generic sys_open() syscall level, another thread in the sandbox might modify the underlying filename *after* i've checked it. But if i had a VFS event at the fs/namei.c::getname() level, i would have access to a central point where the VFS string becomes stable to the kernel and can be checked (and denied if necessary). A sidenote, and not surprisingly, the audit subsystem already has an event callback there: audit_getname(result); Unfortunately this audit callback cannot be used for my purposes, because the event is single-purpose for auditd and because it allows no feedback (no deny/accept discretion for the security policy). But if had this simple event there: err = event_vfs_getname(result); I could implement this new filename based sandboxing policy, using a filter like this installed on the vfs::getname event and inherited by all sandboxed tasks (which cannot uninstall the filter, obviously): if (strstr(name, ..)) return -EACCESS; if (!strncmp(name, /home/sandbox/, 14) !strncmp(name, /lib/, 5) !strncmp(name, /usr/lib/, 9)) return -EACCESS; # # Note1: Obviously the filter engine would be extended to allow such simple string #match functions. ) # # Note2: .. is disallowed so that sandboxed code cannot escape the restrictions # using /... # This kind of flexible and dynamic sandboxing would allow a wide range of file ops within the sandbox, while still isolating it from files not included in the specified VFS namespace. ( Note that there are tons of other examples as well, for useful security features that are best done using events outside the syscall boundary. ) The security event filters code tied to seccomp and syscalls at the moment is useful, but limited in its future potential. So i argue that it should go slightly further and should become: - unprivileged: application-definable, allowing the embedding of security policy in *apps* as well, not just the system - flexible: can be added/removed runtime unprivileged, and cheaply so - transparent: does not impact executing code that meets the policy - nestable: it is inherited by child tasks and is fundamentally stackable, multiple policies will have the combined effect and they are transparent to each other. So if a child task within a sandbox adds *more* checks then those add to the already existing set of checks. We only narrow permissions, never extend them. - generic: allowing observation and (safe) control of security relevant parameters not just at the system call boundary but at other relevant places of kernel execution as well: which points/callbacks could also be used for other types of event extraction such
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Fri, 2011-05-13 at 17:23 +0200, Peter Zijlstra wrote: On Fri, 2011-05-13 at 11:10 -0400, Eric Paris wrote: Then again, I certainly don't see a reason that this syscall hardening patch should be held up while a whole new concept in computer security is contemplated... Which makes me wonder why this syscall hardening stuff is done outside of LSM? Why isn't is part of the LSM so that say SELinux can have a syscall bitmask per security context? I could do that, but I like Will's approach better. From the PoV of meeting security goals of information flow, data confidentiality, integrity, least priv, etc limiting on the syscall boundary doesn't make a lot of sense. You just don't know enough there to enforce these things. These are the types of goals that SELinux and other LSMs have previously tried to enforce. From the PoV of making the kernel more resistant to attacks and making a process more resistant to misbehavior I think that the syscall boundary is appropriate. Although I could do it in SELinux it don't really want to do it there. In case people are interested or confused let me give my definition of two words I've used a bit in these conversations: discretionary and mandatory. Any time I talk about a 'discretionary' security decision it is a security decisions that a process imposed upon itself. Aka the choice to use seccomp is discretionary. The choice to mark our own file u-wx is discretionary. This isn't the best definition but it's one that works well in this discussion. Mandatory security is one enforce by a global policy. It's what selinux is all about. SELinux doesn't give hoot what a process wants to do, it enforces a global policy from the top down. You take over a process, well, too bad, you still have no choice but to follow the mandatory policy. The LSM does NOT enforce a mandatory access control model, it's just how it's been used in the past. Ingo appears to me (please correct me if I'm wrong) to really be a fan of exposing the flexibility of the LSM to a discretionary access control model. That doesn't seem like a bad idea. And maybe using the filter engine to define the language to do this isn't a bad idea either. But I think that's a 'down the road' project, not something to hold up a better seccomp. Making it part of the LSM also avoids having to add this prctl(). Well, it would mean exposing some new language construct to every LSM (instead of a single prctl construct) and it would mean anyone wanting to use the interface would have to rely on the LSM implementing those hooks the way they need it. Honestly chrome can already get all of the benefits of this patch (given a perfectly coded kernel) and a whole lot more using SELinux, but (surprise surprise) not everyone uses SELinux. I think it's a good idea to expose a simple interface which will be widely enough adopted that many userspace applications can rely on it for hardening. The existence of the LSM and the fact that there exists multiple security modules that may or may not be enabled really leads application developers to be unable to rely on LSM for security. If linux had a single security model which everyone could rely on we wouldn't really have as big of an issue but that's not possible. So I'm advocating for this series which will provide a single useful change which applications can rely upon across distros and platforms to enhance the properties and abilities of the linux kernel. -Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Fri, May 13, 2011 at 10:55 AM, Eric Paris epa...@redhat.com wrote: On Fri, 2011-05-13 at 17:23 +0200, Peter Zijlstra wrote: On Fri, 2011-05-13 at 11:10 -0400, Eric Paris wrote: Then again, I certainly don't see a reason that this syscall hardening patch should be held up while a whole new concept in computer security is contemplated... Which makes me wonder why this syscall hardening stuff is done outside of LSM? Why isn't is part of the LSM so that say SELinux can have a syscall bitmask per security context? I could do that, but I like Will's approach better. From the PoV of meeting security goals of information flow, data confidentiality, integrity, least priv, etc limiting on the syscall boundary doesn't make a lot of sense. You just don't know enough there to enforce these things. These are the types of goals that SELinux and other LSMs have previously tried to enforce. From the PoV of making the kernel more resistant to attacks and making a process more resistant to misbehavior I think that the syscall boundary is appropriate. Although I could do it in SELinux it don't really want to do it there. There's also the problem that there are no hooks per-system call for LSMs, only logical hooks that sometimes mirror system call names and are called after user data has been parsed. If system call enter hooks, like seccomp's, were added for LSMs, it would allow the lsm bitmask approach, but it still wouldn't satisfy the issues you raise below (and I wholeheartedly agree with). In case people are interested or confused let me give my definition of two words I've used a bit in these conversations: discretionary and mandatory. Any time I talk about a 'discretionary' security decision it is a security decisions that a process imposed upon itself. Aka the choice to use seccomp is discretionary. The choice to mark our own file u-wx is discretionary. This isn't the best definition but it's one that works well in this discussion. Mandatory security is one enforce by a global policy. It's what selinux is all about. SELinux doesn't give hoot what a process wants to do, it enforces a global policy from the top down. You take over a process, well, too bad, you still have no choice but to follow the mandatory policy. The LSM does NOT enforce a mandatory access control model, it's just how it's been used in the past. Ingo appears to me (please correct me if I'm wrong) to really be a fan of exposing the flexibility of the LSM to a discretionary access control model. That doesn't seem like a bad idea. And maybe using the filter engine to define the language to do this isn't a bad idea either. But I think that's a 'down the road' project, not something to hold up a better seccomp. Making it part of the LSM also avoids having to add this prctl(). Well, it would mean exposing some new language construct to every LSM (instead of a single prctl construct) and it would mean anyone wanting to use the interface would have to rely on the LSM implementing those hooks the way they need it. Honestly chrome can already get all of the benefits of this patch (given a perfectly coded kernel) and a whole lot more using SELinux, but (surprise surprise) not everyone uses SELinux. I think it's a good idea to expose a simple interface which will be widely enough adopted that many userspace applications can rely on it for hardening. The existence of the LSM and the fact that there exists multiple security modules that may or may not be enabled really leads application developers to be unable to rely on LSM for security. If linux had a single security model which everyone could rely on we wouldn't really have as big of an issue but that's not possible. So I'm advocating for this series which will provide a single useful change which applications can rely upon across distros and platforms to enhance the properties and abilities of the linux kernel. -Eric ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
Ok, i like the direction here, but i think the ABI should be done differently. In this patch the ftrace event filter mechanism is used: * Will Drewry w...@chromium.org wrote: +static struct seccomp_filter *alloc_seccomp_filter(int syscall_nr, +const char *filter_string) +{ + int err = -ENOMEM; + struct seccomp_filter *filter = kzalloc(sizeof(struct seccomp_filter), + GFP_KERNEL); + if (!filter) + goto fail; + + INIT_HLIST_NODE(filter-node); + filter-syscall_nr = syscall_nr; + filter-data = syscall_nr_to_meta(syscall_nr); + + /* Treat a filter of SECCOMP_WILDCARD_FILTER as a wildcard and skip + * using a predicate at all. + */ + if (!strcmp(SECCOMP_WILDCARD_FILTER, filter_string)) + goto out; + + /* Argument-based filtering only works on ftrace-hooked syscalls. */ + if (!filter-data) { + err = -ENOSYS; + goto fail; + } + +#ifdef CONFIG_FTRACE_SYSCALLS + err = ftrace_parse_filter(filter-event_filter, + filter-data-enter_event-event.type, + filter_string); + if (err) + goto fail; +#endif + +out: + return filter; + +fail: + kfree(filter); + return ERR_PTR(err); +} Via a prctl() ABI: --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1698,12 +1698,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_SET_ENDIAN: error = SET_ENDIAN(me, arg2); break; - case PR_GET_SECCOMP: error = prctl_get_seccomp(); break; case PR_SET_SECCOMP: - error = prctl_set_seccomp(arg2); + error = prctl_set_seccomp(arg2, arg3); + break; + case PR_SET_SECCOMP_FILTER: + error = prctl_set_seccomp_filter(arg2, + (char __user *) arg3); + break; + case PR_CLEAR_SECCOMP_FILTER: + error = prctl_clear_seccomp_filter(arg2); + break; + case PR_GET_SECCOMP_FILTER: + error = prctl_get_seccomp_filter(arg2, + (char __user *) arg3, + arg4); To restrict execution to system calls. Two observations: 1) We already have a specific ABI for this: you can set filters for events via an event fd. Why not extend that mechanism instead and improve *both* your sandboxing bits and the events code? This new seccomp code has a lot more to do with trace event filters than the minimal old seccomp code ... kernel/trace/trace_event_filter.c is 2000 lines of tricky code that interprets the ASCII filter expressions. kernel/seccomp.c is 86 lines of mostly trivial code. 2) Why should this concept not be made available wider, to allow the restriction of not just system calls but other security relevant components of the kernel as well? This too, if you approach the problem via the events code, will be a natural end result, while if you approach it from the seccomp prctl angle it will be a limited hack only. Note, the end result will be the same - just using a different ABI. So i really think the ABI itself should be closer related to the event code. What this seccomp code does is that it uses specific syscall events to restrict execution of certain event generating codepaths, such as system calls. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
Hi, On Thu, May 12, 2011 at 09:48:50AM +0200, Ingo Molnar wrote: 1) We already have a specific ABI for this: you can set filters for events via an event fd. Why not extend that mechanism instead and improve *both* your sandboxing bits and the events code? This new seccomp code has a lot more to do with trace event filters than the minimal old seccomp code ... Would this require privileges to get the event fd to start with? If so, I would prefer to avoid that, since using prctl() as shown in the patch set won't require any privs. -Kees -- Kees Cook Ubuntu Security Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* Kees Cook kees.c...@canonical.com wrote: Hi, On Thu, May 12, 2011 at 09:48:50AM +0200, Ingo Molnar wrote: 1) We already have a specific ABI for this: you can set filters for events via an event fd. Why not extend that mechanism instead and improve *both* your sandboxing bits and the events code? This new seccomp code has a lot more to do with trace event filters than the minimal old seccomp code ... Would this require privileges to get the event fd to start with? [...] No special privileges with the default perf_events_paranoid value. [...] If so, I would prefer to avoid that, since using prctl() as shown in the patch set won't require any privs. and we could also explicitly allow syscall events without any privileges, regardless of the setting of 'perf_events_paranoid' config value. Obviously a sandboxing host process wants to run with as low privileges as it can. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Wed, 11 May 2011, Will Drewry wrote: +void seccomp_filter_log_failure(int syscall) +{ + printk(KERN_INFO + %s[%d]: system call %d (%s) blocked at ip:%lx\n, + current-comm, task_pid_nr(current), syscall, + syscall_nr_to_name(syscall), KSTK_EIP(current)); +} I think it'd be a good idea to utilize the audit facility here. - James -- James Morris jmor...@namei.org ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Thu, 12 May 2011, Ingo Molnar wrote: 2) Why should this concept not be made available wider, to allow the restriction of not just system calls but other security relevant components of the kernel as well? Because the aim of this is to reduce the attack surface of the syscall interface. LSM is the correct level of abstraction for general security mediation, because it allows you to take into account all relevant security information in a race-free context. This too, if you approach the problem via the events code, will be a natural end result, while if you approach it from the seccomp prctl angle it will be a limited hack only. I'd say it's a well-defined and readily understandable feature. - James -- James Morris jmor...@namei.org ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Thu, May 12, 2011 at 09:48:50AM +0200, Ingo Molnar wrote: To restrict execution to system calls. Two observations: 1) We already have a specific ABI for this: you can set filters for events via an event fd. Why not extend that mechanism instead and improve *both* your sandboxing bits and the events code? This new seccomp code has a lot more to do with trace event filters than the minimal old seccomp code ... kernel/trace/trace_event_filter.c is 2000 lines of tricky code that interprets the ASCII filter expressions. kernel/seccomp.c is 86 lines of mostly trivial code. 2) Why should this concept not be made available wider, to allow the restriction of not just system calls but other security relevant components of the kernel as well? This too, if you approach the problem via the events code, will be a natural end result, while if you approach it from the seccomp prctl angle it will be a limited hack only. Note, the end result will be the same - just using a different ABI. So i really think the ABI itself should be closer related to the event code. What this seccomp code does is that it uses specific syscall events to restrict execution of certain event generating codepaths, such as system calls. Thanks, Ingo What's positive with that approach is that the code is all there already. Create a perf event for a given trace event, attach a filter to it. What needs to be added is an override of the effect of the filter. By default it's dropping the event, but there may be different flavours, including sending a signal. All in one, extending the current code to allow that looks trivial. The negative points are that * trace events are supposed to stay passive and not act on the system, except doing some endpoint things like writing to a buffer. We can't call do_exit() from a tracepoint for example, preemption is disabled there. * Also, is it actually relevant to extend that seccomp filtering to other events than syscalls? Exposing kernel events to filtering sounds actually to me bringing a new potential security issue. But with fine restrictions this can probably be dealt with. Especially if by default only syscalls can be filtered * I think Peter did not want to give such active role to perf in the system. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
* James Morris jmor...@namei.org wrote: On Thu, 12 May 2011, Ingo Molnar wrote: 2) Why should this concept not be made available wider, to allow the restriction of not just system calls but other security relevant components of the kernel as well? Because the aim of this is to reduce the attack surface of the syscall interface. What i suggest achieves the same, my argument is that we could aim it to be even more flexible and even more useful. LSM is the correct level of abstraction for general security mediation, because it allows you to take into account all relevant security information in a race-free context. I don't care about LSM though, i find it poorly designed. The approach implemented here, the ability for *unprivileged code* to define (the seeds of ...) flexible security policies, in a proper Linuxish way, which is inherited along the task parent/child hieararchy and which allows nesting etc. is a *lot* more flexible. What Will implemented here is pretty huge in my opinion: it turns security from a root-only kind of weird hack into an essential component of its APIs, available to *any* app not just the select security policy/mechanism chosen by the distributor ... If implemented properly this could replace LSM in the long run. As a prctl() hack bound to seccomp (which, by all means, is a natural extension to the current seccomp ABI, so perfectly fine if we only want that scope), that is much less likely to happen. And if we merge the seccomp interface prematurely then interest towards a more flexible approach will disappear, so either we do it properly now or it will take some time for someone to come around and do it ... Also note that i do not consider the perf events ABI itself cast into stone - and we could very well add a new system call for this, independent of perf events. I just think that the seccomp scope itself is exciting but looks limited to what the real potential of this could be. This too, if you approach the problem via the events code, will be a natural end result, while if you approach it from the seccomp prctl angle it will be a limited hack only. I'd say it's a well-defined and readily understandable feature. Note, it was me who suggested this very event-filter-engine design a year ago, when the first submission still used a crude bitmap of allowed seccomp syscalls: http://lwn.net/Articles/332974/ Funnily enough, back then you wrote this: I'm concerned that we're seeing yet another security scheme being designed on the fly, without a well-formed threat model, and without taking into account lessons learned from the seemingly endless parade of similar, failed schemes. so when and how did your opinion of this scheme turn from it being an endless parade of failed schemes to it being a well-defined and readily understandable feature? :-) The idea itself has not changed since last year, what happened is that the filter engine got a couple of new features and Will has separated it out and has implemented a working prototype for sandboxing. What i do here is to suggest *further* steps down the same road, now that we see that this scheme can indeed be used to implement sandboxing ... I think it's a valid line of inquiry. Thanks, Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
[Thanks to everyone for the continued feedback and insights - I appreciate it!] On Thu, May 12, 2011 at 8:01 AM, Ingo Molnar mi...@elte.hu wrote: * James Morris jmor...@namei.org wrote: On Thu, 12 May 2011, Ingo Molnar wrote: 2) Why should this concept not be made available wider, to allow the restriction of not just system calls but other security relevant components of the kernel as well? Because the aim of this is to reduce the attack surface of the syscall interface. What i suggest achieves the same, my argument is that we could aim it to be even more flexible and even more useful. LSM is the correct level of abstraction for general security mediation, because it allows you to take into account all relevant security information in a race-free context. I don't care about LSM though, i find it poorly designed. The approach implemented here, the ability for *unprivileged code* to define (the seeds of ...) flexible security policies, in a proper Linuxish way, which is inherited along the task parent/child hieararchy and which allows nesting etc. is a *lot* more flexible. What Will implemented here is pretty huge in my opinion: it turns security from a root-only kind of weird hack into an essential component of its APIs, available to *any* app not just the select security policy/mechanism chosen by the distributor ... If implemented properly this could replace LSM in the long run. As a prctl() hack bound to seccomp (which, by all means, is a natural extension to the current seccomp ABI, so perfectly fine if we only want that scope), that is much less likely to happen. And if we merge the seccomp interface prematurely then interest towards a more flexible approach will disappear, so either we do it properly now or it will take some time for someone to come around and do it ... Also note that i do not consider the perf events ABI itself cast into stone - and we could very well add a new system call for this, independent of perf events. I just think that the seccomp scope itself is exciting but looks limited to what the real potential of this could be. I agree with you on many of these points! However, I don't think that the views around LSMs, perf/ftrace infrastructure, or the current seccomp filtering implementation are necessarily in conflict. Here is my understanding of how the different worlds fit together and where I see this patchset living, along with where I could see future work going. Perhaps I'm being a trifle naive, but here goes anyway: 1. LSMs provide a global mechanism for hooking security relevant events at a point where all the incoming user-sourced data has been preprocessed and moved into userspace. The hooks are called every time one of those boundaries are crossed. 2. Perf and the ftrace infrastructure provide global function tracing and system call hooks with direct access to the caller's registers (and memory). 3. seccomp (as it exists today) provides a global system call entry hook point with a binary per-process decision about whether to provide secure computing behavior. When I boil that down to abstractions, I see: A. Globally scoped: LSMs, ftrace/perf B. Locally/process scoped: seccomp The result of that logical equivalence is that I see room for: I. A per-process, locally scoped security event hooking interface (the proposed changes in this patchset) II. A globally scoped security event hooking interface _prior_ to argument processing III. A globally scoped security event hooking interface _post_ argument processing II and III could be reduced further if I assume that ftrace/perf provides (II) and a simple intermediary layer (hook entry/exit) provides the argument processing steps that then call out a global security policy system. The driving motivation for this patchset is kernel attack surface reduction, but that need arises because we lack a process-scoped mechanism for making security decisions -- everything is global: creds/DAC, containers, LSM, etc. Adding ftrace filtering to agl's original bitmask-seccomp proposal opens up the process-local security world. At present, it can limit the attack surface with simple binary filters or apply limited security policy through the use of filter strings. Based on your mails, I see two main deficiencies in my proposed patchset: a. Deep argument analysis: Any arguments that live in user memory needs to be copied into the kernel, then checked, and substituted for the actual system call, then have the original pointers restored (when applicable) on system call exit. There is a large overhead here and the LSM hooks provide much of this support on a global level. b. Lack of support for non-system call events. For (a), if the long term view of ftrace/perf LSMs is that LSM-like functionality will live on top of the ftrace/perf infrastructure, then adding support for the intermediary layer to analyze arguments will come with time. It's also likely that for
Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
On Thu, 12 May 2011, Ingo Molnar wrote: Funnily enough, back then you wrote this: I'm concerned that we're seeing yet another security scheme being designed on the fly, without a well-formed threat model, and without taking into account lessons learned from the seemingly endless parade of similar, failed schemes. so when and how did your opinion of this scheme turn from it being an endless parade of failed schemes to it being a well-defined and readily understandable feature? :-) When it was defined in a way which limited its purpose to reducing the attack surface of the sycall interface. - James -- James Morris jmor...@namei.org ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev