Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-29 Thread Ingo Molnar

* Peter Zijlstra pet...@infradead.org wrote:

 But face it, you can argue until you're blue in the face,

That is not a technical argument though - and i considered and 
answered every valid technical argument made by you and Thomas.
You were either not able to or not willing to counter them.

 [...] but both tglx and I will NAK any and all patches that extend 
 perf/ftrace beyond the passive observing role.

The thing is, perf is *already* well beyond the 'passive observer' 
role: we already generate lots of 'action' in response to events. We 
generate notification signals, we write events - all of which can 
(and does) modify program behavior.

So what's your point? There's no passive observer role really - 
it's apparently just that you dislike this use of instrumentation 
while you approve of other uses.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-26 Thread Pavel Machek
  On Mon 2011-05-16 10:36:05, James Morris wrote:
 On Fri, 13 May 2011, Ingo Molnar wrote:
 How do you reason about the behavior of the system as a whole?
 
 
  I argue that this is the LSM and audit subsystems designed right: in the 
  long 
  run it could allow everything that LSM does at the moment - and so much 
  more 
  ...
 
 Now you're proposing a redesign of the security subsystem.  That's a 
 significant undertaking.
 
 In the meantime, we have a simple, well-defined enhancement to seccomp 
 which will be very useful to current users in reducing their kernel attack 
 surface.

Well, you can do the same with subterfugue, even without kernel
changes. But that's ptrace -- slow. (And it already shows that syscall
based filters are extremely tricky to configure).

If yu want speed, seccomp+server for non-permitted operations seems like 
reasonable way.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-26 Thread Ingo Molnar

* Pavel Machek pa...@ucw.cz wrote:

   On Mon 2011-05-16 10:36:05, James Morris wrote:
  On Fri, 13 May 2011, Ingo Molnar wrote:
  How do you reason about the behavior of the system as a whole?
  
  
   I argue that this is the LSM and audit subsystems designed right: in the 
   long 
   run it could allow everything that LSM does at the moment - and so much 
   more 
   ...
  
  Now you're proposing a redesign of the security subsystem.  That's a 
  significant undertaking.
  
  In the meantime, we have a simple, well-defined enhancement to seccomp 
  which will be very useful to current users in reducing their kernel attack 
  surface.
 
 Well, you can do the same with subterfugue, even without kernel 
 changes. But that's ptrace -- slow. (And it already shows that 
 syscall based filters are extremely tricky to configure).

Yes, if you use syscall based filters to implement access to 
underlying objects where the access methods do not capture essential 
lifetime events properly (such as files) they you'll quickly run into 
trouble achieving a secure solution.

But you can robustly use syscall filters to control the underlying 
primary *resource*: various pieces of kernel code with *negative* 
utility to the current app - which have no use to the app but pose 
risks in terms of potential exploits in them.

But you can use event filters to implement arbitrary security 
policies robustly.

For example file objects: if you generate the right events for a 
class of objects then you can control access to them very robustly.

It's not a surprise that this is what SELinux does primarily: it has 
lifetime event hooks at the inode object (and socket, packet, etc.) 
level and captures those access attempts and validates them against 
the permissions of that object, in light of the accessing task's 
credentials.

Exactly that can be done with Will's patch as well, if its potential 
scope of event-checking points is not stupidly limited to the syscall 
boundary alone ...

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-26 Thread Ingo Molnar

* Thomas Gleixner t...@linutronix.de wrote:

   We do _NOT_ make any decision based on the trace point so 
   what's the pre-existing active role in the syscall entry 
   code?
  
  The seccomp code we are discussing in this thread.
 
 That's proposed code and has absolutely nothing to do with the 
 existing trace point semantics.

So because it's proposed code it does not exist?

If the feature is accepted (and given Linus's opinion it's not clear 
at all it's accepted in any form) then it's obviously a very 
legitimate technical concern whether we do:

ret = seccomp_check_syscall_event(p1, p2, p3, p4, p5);
if (ret)
return -EACCES;

... random code ...

trace_syscall_event(p1, p2, p3, p4, p5);

Where seccomp_check_syscall_event() duplicates much of the machinery 
that is behind trace_syscall_event().

Or we do the more intelligent:

ret = check_syscall_event(p1, p2, p3, p4, p5);
if (ret)
return -EACCES;

Where we have the happy side effects of:

  - less code at the call site

  - (a lot of!) shared infrastructure between the proposed seccomp 
code and event filters.

  - we'd also be able to trace at security check boundaries - which
has obvious bug analysis advantages.

In fact i do not see *any* advantages in keeping this needlessly 
bloaty and needlessly inconsistently sampled form of instrumentation:

ret = seccomp_check_syscall_event(p1, p2, p3, p4, p5);
if (ret)
return -EACCES;

... random code ...

trace_syscall_event(p1, p2, p3, p4, p5);

Do you?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-25 Thread Thomas Gleixner
On Tue, 24 May 2011, Ingo Molnar wrote:
 * Peter Zijlstra pet...@infradead.org wrote:
 
  On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
include/linux/ftrace_event.h  |4 +-
include/linux/perf_event.h|   10 +---
kernel/perf_event.c   |   49 
   +---
kernel/seccomp.c  |8 ++
kernel/trace/trace_syscalls.c |   27 +-
5 files changed, 82 insertions(+), 16 deletions(-) 
  
  I strongly oppose to the perf core being mixed with any sekurity voodoo
  (or any other active role for that matter).
 
 I'd object to invisible side-effects as well, and vehemently so. But note how 
 intelligently it's used here: it's explicit in the code, it's used explicitly 
 in kernel/seccomp.c and the event generation place in 
 kernel/trace/trace_syscalls.c.
 
 So this is a really flexible solution IMO and does not extend events with 
 some 
 invisible 'active' role. It extends the *call site* with an open-coded active 
 role - which active role btw. already pre-existed.

We do _NOT_ make any decision based on the trace point so what's the
pre-existing active role in the syscall entry code?

I'm all for code reuse and reuse of interfaces, but this is completely
wrong. Instrumentation and security decisions are two fundamentally
different things and we want them kept separate. Instrumentation is
not meant to make decisions. Just because we can does not mean that it
is a good idea.

So what the current approach does is:

 - abuse the existing ftrace syscall hook by adding a return value to
   the tracepoint.

   So we need to propagate that for every tracepoint just because we
   have a single user.

 - abuse the perf per task mechanism

   Just because we have per task context in perf does not mean that we
   pull everything and the world which requires per task context into
   perf. The security folks have per task context already so security
   related stuff wants to go there.

 - abuse the perf/ftrace interfaces

   One of the arguments was that perf and ftrace have permission which
   are not available from the existing security interfaces. That's not
   at all a good reason to abuse these interfaces. Let the security
   folks sort out the problem on their end and do not impose any
   expectations on perf/ftrace which we have to carry around forever.

Yes, it can be made working with a relatively small patch, but it has
a very nasty side effect: 

  You add another user space visible ABI to the existing perf/ftrace
  mess which needs to be supported forever.

Brilliant, we have already two ABIs (perf/ftrace) to support and at
the same time we urgently need to solve the problem of better
integration of those two. So adding a third completely unrelated
component with a guaranteed ABI is just making this even more complex.

We can factor out the filtering code and let the security dudes reuse
it for their own purposes. That makes them to have their own
interfaces and does not impose any restrictions upon the tracing/perf
ones. And really security stuff wants to be integrated into the
existing security frameworks and not duct taped into perf/trace just
because it's a conveniant hack around limitiations of the existing
security stuff.

You really should stop to see everything as a nail just because the
only tool you have handy is the perf hammer. perf is about
instrumentation and we don't want to violate the oldest principle of
unix to have simple tools which do one thing and do it good.

Even swiss army knifes have the restriction that you can use only one
tool at a time unless you want to stick the corkscrew through your
palm when you try to cut bread.

Thanks,

tglx
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-25 Thread Ingo Molnar

* Thomas Gleixner t...@linutronix.de wrote:

 On Tue, 24 May 2011, Ingo Molnar wrote:
  * Peter Zijlstra pet...@infradead.org wrote:
  
   On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
 include/linux/ftrace_event.h  |4 +-
 include/linux/perf_event.h|   10 +---
 kernel/perf_event.c   |   49 
+---
 kernel/seccomp.c  |8 ++
 kernel/trace/trace_syscalls.c |   27 +-
 5 files changed, 82 insertions(+), 16 deletions(-) 
   
   I strongly oppose to the perf core being mixed with any sekurity voodoo
   (or any other active role for that matter).
  
  I'd object to invisible side-effects as well, and vehemently so. But note 
  how 
  intelligently it's used here: it's explicit in the code, it's used 
  explicitly 
  in kernel/seccomp.c and the event generation place in 
  kernel/trace/trace_syscalls.c.
  
  So this is a really flexible solution IMO and does not extend events with 
  some 
  invisible 'active' role. It extends the *call site* with an open-coded 
  active 
  role - which active role btw. already pre-existed.
 
 We do _NOT_ make any decision based on the trace point so what's the
 pre-existing active role in the syscall entry code?

The seccomp code we are discussing in this thread.

 I'm all for code reuse and reuse of interfaces, but this is completely
 wrong. Instrumentation and security decisions are two fundamentally
 different things and we want them kept separate. Instrumentation is
 not meant to make decisions. Just because we can does not mean that it
 is a good idea.

Instrumentation does not 'make decisions': the calling site, which is 
already emitting both the event and wants to do decisions based on 
the data that also generates the event wants to do decisions.

Those decisions *will be made* and you cannot prevent that, the only 
open question is can it reuse code intelligently, which code it is 
btw. already calling for observation reasons?

( Note that pure observers wont be affected and note that pure 
  observation call sites are not affected either. )

 So what the current approach does is:
 
  - abuse the existing ftrace syscall hook by adding a return value to
the tracepoint.
 
So we need to propagate that for every tracepoint just because we
have a single user.

This is a technical criticism i share with you and i think it can be 
fixed - i outlined it to Will yesterday.

And no, if done cleanly it's not 'abuse' to reuse code. Could we wait 
for the first clean iteration of this patch instead of rushing 
judgement prematurely?

  - abuse the perf per task mechanism
 
Just because we have per task context in perf does not mean that we
pull everything and the world which requires per task context into
perf. The security folks have per task context already so security
related stuff wants to go there.

We do not pull 'everything and the world' in, but code that wants to 
process events in places that already emit events surely sounds 
related to me :-)

  - abuse the perf/ftrace interfaces
 
One of the arguments was that perf and ftrace have permission which
are not available from the existing security interfaces. That's not
at all a good reason to abuse these interfaces. Let the security
folks sort out the problem on their end and do not impose any
expectations on perf/ftrace which we have to carry around forever.
 
 Yes, it can be made working with a relatively small patch, but it has
 a very nasty side effect: 
 
   You add another user space visible ABI to the existing perf/ftrace
   mess which needs to be supported forever.

What mess? I'm not aware of a mess - other than the ftrace API which 
is not used by this patch.

 Brilliant, we have already two ABIs (perf/ftrace) to support and at
 the same time we urgently need to solve the problem of better
 integration of those two. So adding a third completely unrelated
 component with a guaranteed ABI is just making this even more complex.

So your solution is to add yet another ABI for seccomp and to keep 
seccomp a limited hack forever, just because you are not interested 
in security?

I think we want fewer ABIs and more flexible/reusable facilities.

 We can factor out the filtering code and let the security dudes 
 reuse it for their own purposes. That makes them to have their own 
 interfaces and does not impose any restrictions upon the 
 tracing/perf ones. And really security stuff wants to be integrated 
 into the existing security frameworks and not duct taped into 
 perf/trace just because it's a conveniant hack around limitiations 
 of the existing security stuff.

You are missing what i tried to point out in earlier discussions: 
from a security design POV this isnt just about the system call 
boundary. If this seccomp variant is based on events then it could 
grow proper security checks in other places as well, in places where 
we have some 

Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-25 Thread Peter Zijlstra
On Wed, 2011-05-25 at 17:01 +0200, Ingo Molnar wrote:
  We do _NOT_ make any decision based on the trace point so what's the
  pre-existing active role in the syscall entry code?
 
 The seccomp code we are discussing in this thread. 

That isn't pre-existing, that's proposed.

But face it, you can argue until you're blue in the face, but both tglx
and I will NAK any and all patches that extend perf/ftrace beyond the
passive observing role.

Your arguments appear to be as non-persuasive to us as ours are to you,
so please drop this endeavor and let the security folks sort it on their
own and let's get back to doing useful work. 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-25 Thread Thomas Gleixner
On Wed, 25 May 2011, Ingo Molnar wrote:
 * Thomas Gleixner t...@linutronix.de wrote:
  On Tue, 24 May 2011, Ingo Molnar wrote:
   * Peter Zijlstra pet...@infradead.org wrote:
   
On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
  include/linux/ftrace_event.h  |4 +-
  include/linux/perf_event.h|   10 +---
  kernel/perf_event.c   |   49 
 +---
  kernel/seccomp.c  |8 ++
  kernel/trace/trace_syscalls.c |   27 +-
  5 files changed, 82 insertions(+), 16 deletions(-) 

I strongly oppose to the perf core being mixed with any sekurity voodoo
(or any other active role for that matter).
   
   I'd object to invisible side-effects as well, and vehemently so. But note 
   how 
   intelligently it's used here: it's explicit in the code, it's used 
   explicitly 
   in kernel/seccomp.c and the event generation place in 
   kernel/trace/trace_syscalls.c.
   
   So this is a really flexible solution IMO and does not extend events with 
   some 
   invisible 'active' role. It extends the *call site* with an open-coded 
   active 
   role - which active role btw. already pre-existed.
  
  We do _NOT_ make any decision based on the trace point so what's the
  pre-existing active role in the syscall entry code?
 
 The seccomp code we are discussing in this thread.

That's proposed code and has absolutely nothing to do with the
existing trace point semantics.
 
  I'm all for code reuse and reuse of interfaces, but this is completely
  wrong. Instrumentation and security decisions are two fundamentally
  different things and we want them kept separate. Instrumentation is
  not meant to make decisions. Just because we can does not mean that it
  is a good idea.
 
 Instrumentation does not 'make decisions': the calling site, which is 
 already emitting both the event and wants to do decisions based on 
 the data that also generates the event wants to do decisions.

You can repeat that as often as you want, it does not make it more
true. Fact is that the decision is made in the middle of the perf code.

 + /* Transition the task if required. */
 + if (ctx-type == task_context  event-attr.require_secure) {
 +#ifdef CONFIG_SECCOMP
 + /* Don't allow perf events to escape mode = 1. */
 +if (!current-seccomp.mode)
 + current-seccomp.mode = 2;
 +#endif
 + }
 
and further down

 + if (event-attr.err_on_discard)
 + ok = -EACCES;
 
 Those decisions *will be made* and you cannot prevent that, the only 
 open question is can it reuse code intelligently, which code it is 
 btw. already calling for observation reasons?

The tracepoint is called for observation reasons and now you make it a
decision function. That's what I call abuse.

 ( Note that pure observers wont be affected and note that pure 
   observation call sites are not affected either. )

Hahaha, they still have to run through the additional code when
seccomp is enabled and we still have to propagate the return value
down to the point where the tracepoint itself is. You call that not
affected?
 
  So what the current approach does is:
  
   - abuse the existing ftrace syscall hook by adding a return value to
 the tracepoint.
  
 So we need to propagate that for every tracepoint just because we
 have a single user.
 
 This is a technical criticism i share with you and i think it can be 
 fixed - i outlined it to Will yesterday.
 
 And no, if done cleanly it's not 'abuse' to reuse code. Could we wait 
 for the first clean iteration of this patch instead of rushing 
 judgement prematurely?

There is no way to do it cleanly. It always comes for the price that
you add additional code into the tracing code path. And there are
other people who try hard to remove stuff to recude the overhead which
is caused by instrumentation.

   - abuse the perf per task mechanism
  
 Just because we have per task context in perf does not mean that we
 pull everything and the world which requires per task context into
 perf. The security folks have per task context already so security
 related stuff wants to go there.
 
 We do not pull 'everything and the world' in, but code that wants to 
 process events in places that already emit events surely sounds 
 related to me :-)

We have enough places where different independent parts of the kernel
want to hook into for obvious reasons.

We have notifiers for those where performance does not matter much and
we have separate calls into the independent functions where it matters
or where we need to evaluate the results in specific ways.

So now you turn instrumentation into a security mechanism, which
works nicely for a particular purpose, i.e. decision on a particular
syscall number. Now, how do you make that work when a decision has to
be made on more than a simple match, e.g. syscall number + arguments ?

Not 

Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-24 Thread Will Drewry
On Thu, May 19, 2011 at 4:05 PM, Will Drewry w...@chromium.org wrote:
 On Thu, May 19, 2011 at 7:22 AM, Steven Rostedt rost...@goodmis.org wrote:
 On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote:

 Do event_* that return non-void exist in the tree at all now?  I've
 looked at the various tracepoint macros as well as some of the other
 handlers (trace_function, perf_tp_event, etc) and I'm not seeing any
 places where a return value is honored nor could be.  At best, the
 perf_tp_event can be short-circuited it in the hlist_for_each, but
 it'd still need a way to bubble up a failure and result in not calling
 the trace/event that the hook precedes.

 No, none of the current trace hooks have return values. That was what I
 was talking about how to implement in my previous emails.

 Led on by complete ignorance, I think I'm finally beginning to untwist
 the different pieces of the tracing infrastructure.  Unfortunately,
 that means I took a few wrong turns along the way...

 I think function tracing looks something like this:

 ftrace_call has been injected into at a specific callsite.  Upon hit:
 1. ftrace_call triggers
 2. does some checks then calls ftrace_trace_function (via mcount instrs)
 3. ftrace_trace_function may be a single func or a list. For a list it
 will be: ftrace_list_func
 4. ftrace_list_func calls each registered hook for that function in a
 while loop ignoring return values
 5. registered hook funcs may then track the call, farm it out to
 specific sub-handlers, etc.

 This seems to be a red herring for my use case :/ though this helped
 me understand your back and forth (Ingo  Steve) regarding dynamic
 versus explicit events.


 System call tracing is done via kernel/tracepoint.c events fed in via
 arch/[arch]/kernel/ptrace.c where it calls trace_sys_enter.  This
 yields direct sys_enter and sys_exit event sources (and an event class
 to hook up per-system call events).  This means that
 ftrace_syscall_enter() does the event prep prior to doing a filter
 check comparing the ftrace_event object for the given syscall_nr to
 the event data.  perf_sysenter_enter() is similar but it pushes the
 info over to perf_tp_event to be matched not against the global
 syscall event entry, but against any sub-event in the linked list on
 that syscall's event.

 Is that roughly an accurate representation of the two?  I wish I
 hadn't gotten distracted along the function path, but at least I
 learned something (and it is relevant to the larger scope of this
 thread's discussion).


 After doing that digging, it looks like providing hook call
 pre-emption and return value propagation will be a unique and fun task
 for each tracer and event subsystem.  If I look solely at tracepoints,
 a generic change would be to make the trace_##subsys function return
 an int (which I think was the event_vfs_getname proposal?).  The other
 option is to change the trace_sys_enter proto to include a 'int
 *retcode'.

 That change would allow the propagation of some sort of policy
 information.  To put it to use, seccomp mode 1 could be implemented on
 top of trace_syscalls.  The following changes would need to happen:
 1. dummy metadata should be inserted for all unhooked system calls
 2. perf_trace_buf_submit would need to return an int or a new
 TRACE_REG_SECCOMP_REGISTER handler would need to be setup in
 syscall_enter_register.
 3. If perf is abused, a kill/fail_on_discard bit would be added to
 event-attrs.
 4. perf_trace_buf_submit/perf_tp_event will return 0 for no matches,
 'n' for the number of event matches, and -EACCES/? if a
 'fail_on_discard' event is seen.
 5. perf_syscall_enter would set *retcode = perf_trace_buf_submit()'s retcode
 6. trace_sys_enter() would need to be moved to be the first entry
 arch/../kernel/ptrace.c for incoming syscalls
 7. if trace_sys_enter() yields a negative return code, then
 do_exit(SIGKILL) the process and return.

 Entering into seccomp mode 1 would require adding a  0 filter for
 every system call number (which is why we need a dummy event call for
 them since failing to check the bitmask can't be flagged
 fail_on_discard) with the fail_on_discard bit.  For the three calls
 that are allowed, a '1' filter would be set.

 That would roughly implement seccomp mode 1.  It's pretty ugly and the
 fact that every system call that's disallowed has to be blacklisted is
 not ideal.  An alternate model would be to just use the seccomp mode
 as we do today and let secure_computing() handle the return code of #
 of matches.  If it the # of matches is 0, it terminates. A
 'fail_on_discard' bit then would only be good to stop further
 tracepoint callback evaluation.  This approach would also *almost* nix
 the need to provide dummy syscall hooks.  (Since sigreturn isn't
 hooked on x86 because it uses ptregs fixup, a dummy would still be
 needed to apply a 1 filter to.)

 Even with that tweak to move to a whitelist model, the perf event
 evaluation and tracepoint callback ordering is 

Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-24 Thread Peter Zijlstra
On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
  include/linux/ftrace_event.h  |4 +-
  include/linux/perf_event.h|   10 +---
  kernel/perf_event.c   |   49 +---
  kernel/seccomp.c  |8 ++
  kernel/trace/trace_syscalls.c |   27 +-
  5 files changed, 82 insertions(+), 16 deletions(-) 

I strongly oppose to the perf core being mixed with any sekurity voodoo
(or any other active role for that matter).


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-24 Thread Thomas Gleixner
On Tue, 24 May 2011, Peter Zijlstra wrote:

 On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
   include/linux/ftrace_event.h  |4 +-
   include/linux/perf_event.h|   10 +---
   kernel/perf_event.c   |   49 
  +---
   kernel/seccomp.c  |8 ++
   kernel/trace/trace_syscalls.c |   27 +-
   5 files changed, 82 insertions(+), 16 deletions(-) 
 
 I strongly oppose to the perf core being mixed with any sekurity voodoo
 (or any other active role for that matter).

Amen. We have enough crap to cleanup in perf/ftrace already, so we
really do not need security magic added to it.

Thanks,

tglx
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-24 Thread Will Drewry
On Tue, May 24, 2011 at 11:25 AM, Thomas Gleixner t...@linutronix.de wrote:
 On Tue, 24 May 2011, Peter Zijlstra wrote:

 On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
   include/linux/ftrace_event.h  |    4 +-
   include/linux/perf_event.h    |   10 +---
   kernel/perf_event.c           |   49 
  +---
   kernel/seccomp.c              |    8 ++
   kernel/trace/trace_syscalls.c |   27 +-
   5 files changed, 82 insertions(+), 16 deletions(-)

 I strongly oppose to the perf core being mixed with any sekurity voodoo
 (or any other active role for that matter).

 Amen. We have enough crap to cleanup in perf/ftrace already, so we
 really do not need security magic added to it.

Thanks for the quick responses!

I agree, but I'm left a little bit lost now w.r.t. the comments around
reusing the ABI.  If perf doesn't make sense (which certainly seems
wrong from a security interface perspective), then the preexisting
ABIs I know of for this case are as follows:
- /sys/kernel/debug/tracing/*
- prctl(PR_SET_SECCOMP* (or /proc/...)

Both would require expansion.  The latter was reused by the original
patch series.  The former doesn't expose much in the way of per-task
event filtering -- ftrace_pids doesn't translate well to
ftrace_syscall_enter-based enforcement.  I'd expect we'd need
ftrace_event_call-task_events (like -perf_events), and either
explore them in ftrace_syscall_enter or add a new tracepoint handler,
ftrace_task_syscall_enter, via something like TRACE_REG_TASK_REGISTER.
 It could then do whatever it wanted with the successful or
unsuccessful matching against predicates, stacking or not, which could
be used for a seccomp-like mechanism.  However, bubbling that change
up to the non-existent interfaces in debug/tracing could be a
challenge too (Registration would require an alternate flow like perf
to call TRACE_REG_*? Do they become
tracing/events/subsystem/event/task/tid/filter_string_N? ...?).

This is all just a matter of programming... but at this point, I'm not
seeing the clear shared path forward.  Even with per-task ftrace
access in debug/tracing, that would introduce a reasonably large
change to the system and add a new ABI, albeit in debug/tracing.  If
the above (or whatever the right approach is) comes into existence,
then any prctl(PR_SET_SECCOMP) ABI could have the backend
implementation to modify the same data.  I'm not putting it like this
to say that I'm designing to be obsolete, but to show that the defined
interface wouldn't conflict if ftrace does overlap more in the future.
 Given the importance of a clearly defined interface for security
functionality, I'd be surprised to see all the pieces come together in
the near future in such a way that a transition would be immediately
possible -- I'm not even sure what the ftrace roadmap really is!

Would it be more desirable to put a system call filtering interface on
a miscdev (like /dev/syscall_filter) instead of in /proc or prctl (and
not reuse seccomp at all)?  I'm not clear what the onus is to justify
a change in the different ABI areas, but I see system call filtering
as an important piece of system security and would like to determine
if there is a viable path forward, or if this will need to be
revisited in another 2 years.

thanks again!
will
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-24 Thread Ingo Molnar

* Peter Zijlstra pet...@infradead.org wrote:

 On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
   include/linux/ftrace_event.h  |4 +-
   include/linux/perf_event.h|   10 +---
   kernel/perf_event.c   |   49 
  +---
   kernel/seccomp.c  |8 ++
   kernel/trace/trace_syscalls.c |   27 +-
   5 files changed, 82 insertions(+), 16 deletions(-) 
 
 I strongly oppose to the perf core being mixed with any sekurity voodoo
 (or any other active role for that matter).

I'd object to invisible side-effects as well, and vehemently so. But note how 
intelligently it's used here: it's explicit in the code, it's used explicitly 
in kernel/seccomp.c and the event generation place in 
kernel/trace/trace_syscalls.c.

So this is a really flexible solution IMO and does not extend events with some 
invisible 'active' role. It extends the *call site* with an open-coded active 
role - which active role btw. already pre-existed.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-24 Thread Ingo Molnar

* Will Drewry w...@chromium.org wrote:

 The change avoids defining a new trace call type or a huge number of internal 
 changes and hides seccomp.mode=2 from ABI-exposure in prctl, but the attack 
 surface is non-trivial to verify, and I'm not sure if this ABI change makes 
 sense. It amounts to:
 
  include/linux/ftrace_event.h  |4 +-
  include/linux/perf_event.h|   10 +---
  kernel/perf_event.c   |   49 +---
  kernel/seccomp.c  |8 ++
  kernel/trace/trace_syscalls.c |   27 +-
  5 files changed, 82 insertions(+), 16 deletions(-)
 
 And can be found here: http://static.dataspill.org/perf_secure/v1/

Wow, i'm very impressed how few changes you needed to do to support this!

So, firstly, i don't think we should change perf_tp_event() at all - the 
'observer' codepaths should be unaffected.

But there could be a perf_tp_event_ret() or perf_tp_event_check() entry that 
code like seccomp which wants to use event results can use.

Also, i'm not sure about the seccomp details and assumptions that were moved 
into the perf core. How about passing in a helper function to 
perf_tp_event_check(), where seccomp would define its seccomp specific helper 
function?

That looks sufficiently flexible. That helper function could be an 'extra 
filter' kind of thing, right?

Also, regarding the ABI and the attr.err_on_discard and attr.require_secure 
bits, they look a bit too specific as well.

attr.err_on_discard: with the filter helper function passed in this is probably 
not needed anymore, right?

attr.require_secure: this is basically used to *force* the creation of 
security-controlling filters, right? It seems to me that this could be done via 
a seccomp ABI extension as well, without adding this to the perf ABI. That 
seccomp call could check whether the right events are created and move the task 
to mode 2 only if that prereq is met - or something like that.

 If there is any interest at all, I can post it properly to this giant
 CC list. [...]

I'd suggest to trim the Cc: list aggressively - anyone interested in the 
discussion can pick it up on lkml - and i strongly suspect that most of the Cc: 
participants would want to be off the Cc: :-)

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-24 Thread Ingo Molnar

* Ingo Molnar mi...@elte.hu wrote:

 
 * Peter Zijlstra pet...@infradead.org wrote:
 
  On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote:
include/linux/ftrace_event.h  |4 +-
include/linux/perf_event.h|   10 +---
kernel/perf_event.c   |   49 
   +---
kernel/seccomp.c  |8 ++
kernel/trace/trace_syscalls.c |   27 +-
5 files changed, 82 insertions(+), 16 deletions(-) 
  
  I strongly oppose to the perf core being mixed with any sekurity voodoo
  (or any other active role for that matter).
 
 I'd object to invisible side-effects as well, and vehemently so. But note how 
 intelligently it's used here: it's explicit in the code, it's used explicitly 
 in kernel/seccomp.c and the event generation place in 
 kernel/trace/trace_syscalls.c.
 
 So this is a really flexible solution IMO and does not extend events with 
 some invisible 'active' role. It extends the *call site* with an open-coded 
 active role - which active role btw. already pre-existed.

Also see my other mail - i think this seccomp code is too tied in to the perf 
core and ABI - but this is fixable IMO.

The fundamental notion that a generator subsystem of events can use filter 
results as well (such as kernel/trace/trace_syscalls.c.) for its own purposes 
is pretty robust though.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-24 Thread Steven Rostedt
On Tue, 2011-05-24 at 22:08 +0200, Ingo Molnar wrote:
 * Will Drewry w...@chromium.org wrote:

 But there could be a perf_tp_event_ret() or perf_tp_event_check() entry that 
 code like seccomp which wants to use event results can use.

We should name it something else. The perf_tp.. is a misnomer as it
has nothing to do with performance monitoring. dynamic_event_.. maybe,
as it is dynamic to the affect that we can use jump labels to enable or
disable it.

-- Steve


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-19 Thread Steven Rostedt
On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote:

 Do event_* that return non-void exist in the tree at all now?  I've
 looked at the various tracepoint macros as well as some of the other
 handlers (trace_function, perf_tp_event, etc) and I'm not seeing any
 places where a return value is honored nor could be.  At best, the
 perf_tp_event can be short-circuited it in the hlist_for_each, but
 it'd still need a way to bubble up a failure and result in not calling
 the trace/event that the hook precedes.

No, none of the current trace hooks have return values. That was what I
was talking about how to implement in my previous emails.

-- Steve


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-19 Thread Will Drewry
On Thu, May 19, 2011 at 7:22 AM, Steven Rostedt rost...@goodmis.org wrote:
 On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote:

 Do event_* that return non-void exist in the tree at all now?  I've
 looked at the various tracepoint macros as well as some of the other
 handlers (trace_function, perf_tp_event, etc) and I'm not seeing any
 places where a return value is honored nor could be.  At best, the
 perf_tp_event can be short-circuited it in the hlist_for_each, but
 it'd still need a way to bubble up a failure and result in not calling
 the trace/event that the hook precedes.

 No, none of the current trace hooks have return values. That was what I
 was talking about how to implement in my previous emails.

Led on by complete ignorance, I think I'm finally beginning to untwist
the different pieces of the tracing infrastructure.  Unfortunately,
that means I took a few wrong turns along the way...

I think function tracing looks something like this:

ftrace_call has been injected into at a specific callsite.  Upon hit:
1. ftrace_call triggers
2. does some checks then calls ftrace_trace_function (via mcount instrs)
3. ftrace_trace_function may be a single func or a list. For a list it
will be: ftrace_list_func
4. ftrace_list_func calls each registered hook for that function in a
while loop ignoring return values
5. registered hook funcs may then track the call, farm it out to
specific sub-handlers, etc.

This seems to be a red herring for my use case :/ though this helped
me understand your back and forth (Ingo  Steve) regarding dynamic
versus explicit events.


System call tracing is done via kernel/tracepoint.c events fed in via
arch/[arch]/kernel/ptrace.c where it calls trace_sys_enter.  This
yields direct sys_enter and sys_exit event sources (and an event class
to hook up per-system call events).  This means that
ftrace_syscall_enter() does the event prep prior to doing a filter
check comparing the ftrace_event object for the given syscall_nr to
the event data.  perf_sysenter_enter() is similar but it pushes the
info over to perf_tp_event to be matched not against the global
syscall event entry, but against any sub-event in the linked list on
that syscall's event.

Is that roughly an accurate representation of the two?  I wish I
hadn't gotten distracted along the function path, but at least I
learned something (and it is relevant to the larger scope of this
thread's discussion).


After doing that digging, it looks like providing hook call
pre-emption and return value propagation will be a unique and fun task
for each tracer and event subsystem.  If I look solely at tracepoints,
a generic change would be to make the trace_##subsys function return
an int (which I think was the event_vfs_getname proposal?).  The other
option is to change the trace_sys_enter proto to include a 'int
*retcode'.

That change would allow the propagation of some sort of policy
information.  To put it to use, seccomp mode 1 could be implemented on
top of trace_syscalls.  The following changes would need to happen:
1. dummy metadata should be inserted for all unhooked system calls
2. perf_trace_buf_submit would need to return an int or a new
TRACE_REG_SECCOMP_REGISTER handler would need to be setup in
syscall_enter_register.
3. If perf is abused, a kill/fail_on_discard bit would be added to
event-attrs.
4. perf_trace_buf_submit/perf_tp_event will return 0 for no matches,
'n' for the number of event matches, and -EACCES/? if a
'fail_on_discard' event is seen.
5. perf_syscall_enter would set *retcode = perf_trace_buf_submit()'s retcode
6. trace_sys_enter() would need to be moved to be the first entry
arch/../kernel/ptrace.c for incoming syscalls
7. if trace_sys_enter() yields a negative return code, then
do_exit(SIGKILL) the process and return.

Entering into seccomp mode 1 would require adding a  0 filter for
every system call number (which is why we need a dummy event call for
them since failing to check the bitmask can't be flagged
fail_on_discard) with the fail_on_discard bit.  For the three calls
that are allowed, a '1' filter would be set.

That would roughly implement seccomp mode 1.  It's pretty ugly and the
fact that every system call that's disallowed has to be blacklisted is
not ideal.  An alternate model would be to just use the seccomp mode
as we do today and let secure_computing() handle the return code of #
of matches.  If it the # of matches is 0, it terminates. A
'fail_on_discard' bit then would only be good to stop further
tracepoint callback evaluation.  This approach would also *almost* nix
the need to provide dummy syscall hooks.  (Since sigreturn isn't
hooked on x86 because it uses ptregs fixup, a dummy would still be
needed to apply a 1 filter to.)

Even with that tweak to move to a whitelist model, the perf event
evaluation and tracepoint callback ordering is still not guaranteed.
Without changing tracepoint itself, all other TPs will still execute.
And for perf events, it'll be first-come-first-serve 

Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-18 Thread Will Drewry
On Tue, May 17, 2011 at 6:19 AM, Ingo Molnar mi...@elte.hu wrote:

 * Steven Rostedt rost...@goodmis.org wrote:

 On Tue, 2011-05-17 at 14:42 +0200, Ingo Molnar wrote:
  * Steven Rostedt rost...@goodmis.org wrote:
 
   On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote:
* Steven Rostedt rost...@goodmis.org wrote:
   
 I'm a bit nervous about the 'active' role of (trace_)events, because 
 of the
 way multiple callbacks can be registered. How would:

       err = event_x();
       if (err == -EACCESS) {

 be handled? [...]
   
The default behavior would be something obvious: to trigger all 
callbacks and
use the first non-zero return value.
  
   But how do we know which callback that was from? There's no ordering of 
   what
   callbacks are called first.
 
  We do not have to know that - nor do the calling sites care in general. Do 
  you
  have some specific usecase in mind where the identity of the callback that
  generates a match matters?

 Maybe I'm confused. I was thinking that these event_*() are what we
 currently call trace_*(), but the event_*(), I assume, can return a
 value if a call back returns one.

 Yeah - and the call site can treat it as:

  - Ugh, if i get an error i need to abort whatever i was about to do

 or (more advanced future use):

  - If i get a positive value i need to re-evaluate the parameters that were
   passed in, they were changed

Do event_* that return non-void exist in the tree at all now?  I've
looked at the various tracepoint macros as well as some of the other
handlers (trace_function, perf_tp_event, etc) and I'm not seeing any
places where a return value is honored nor could be.  At best, the
perf_tp_event can be short-circuited it in the hlist_for_each, but
it'd still need a way to bubble up a failure and result in not calling
the trace/event that the hook precedes.

Am I missing something really obvious?  I don't feel I've gotten a
good handle on exactly how all the tracing code gets triggered, so
perhaps I'm still a level (or three) too shallow. (I can see the asm
hooks for trace functions and I can see where that translates to
registered calls - like trace_function - but I don't see how the
hooked calls can be trivially aborted).

As is, I'm not sure how the perf and ftrace infrastructure could be
reused cleanly without a fair number of hacks to the interface and a
good bit of reworking.  I can already see a number of challenges
around reusing the sys_perf_event_open interface and the fact that
reimplementing something even as simple as seccomp mode=1 seems to
require a fair amount of tweaking to avoid from being leaky.  (E.g.,
enabling all TRACE_EVENT()s for syscalls will miss unhooked syscalls
so either acceptance matching needs to be propagated up the stack
along with some seccomp-like task modality or seccomp-on-perf would
have to depend on sys_enter events with syscall number predicate
matching and fail when a filter discard applies to all active events.)

At present, I'm leaning back towards the v2 series (plus the requested
minor changes) for the benefit of code clarity and its fail-secure
behavior.  Even just considering the reduced case of seccomp mode 1
being implemented on the shared infrastructure, I feel like I missing
something that makes it viable.  Any clues?

If not, I don't think a seccomp mode 2 interface via prctl would be
intractable if the long term movement is to a ftrace/perf backend - it
just means that the in-kernel code would change to wrap whatever the
final design ended up being.

Thanks and sorry if I'm being dense!

 Thus, we now have the ability to dynamically attach function calls to
 arbitrary points in the kernel that can have an affect on the code that
 called it. Right now, we only have the ability to attach function calls to
 these locations that have passive affects (tracing/profiling).

 Well, they can only have the effect that the calling site accepts and handles.
 So the 'effect' is not arbitrary and not defined by the callbacks, it is
 controlled and handled by the calling code.

 We do not want invisible side-effects, opaque hooks, etc.

 Instead of that we want (this is the getname() example i cited in the thread)
 explicit effects, like:

  if (event_vfs_getname(result))
        return ERR_PTR(-EPERM);

 But you say, nor do the calling sites care in general. Then what do
 these calling sites do with the return code? Are we limiting these
 actions to security only? Or can we have some other feature. [...]

 Yeah, not just security. One other example that came up recently is whether to
 panic the box on certain (bad) events such as NMI errors. This too could be
 made flexible via the event filter code: we already capture many events, so
 places that might conceivably do some policy could do so based on a filter
 condition.

This sounds great - I just wish I could figure out how it'd work :)

 [...] I can envision that we can make the Linux kernel quite dynamic here
 

Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-17 Thread Ingo Molnar

* Steven Rostedt rost...@goodmis.org wrote:

 On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote:
  * Steven Rostedt rost...@goodmis.org wrote:
  
   I'm a bit nervous about the 'active' role of (trace_)events, because of 
   the 
   way multiple callbacks can be registered. How would:
   
 err = event_x();
 if (err == -EACCESS) {
   
   be handled? [...]
  
  The default behavior would be something obvious: to trigger all callbacks 
  and 
  use the first non-zero return value.
 
 But how do we know which callback that was from? There's no ordering of what 
 callbacks are called first.

We do not have to know that - nor do the calling sites care in general. Do you 
have some specific usecase in mind where the identity of the callback that 
generates a match matters?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-17 Thread Ingo Molnar

* Will Drewry w...@chromium.org wrote:

  This is *far* more generic still yields the same short-term end result as 
  far as your sandboxing is concerned.
 
 Almost :/ [...]

Hey that's a pretty good result from a subsystem that was not written with your 
usecase in mind *at all* ;-)

 [...]  I still need to review the code you've pointed out, but, at present, 
 the ftrace hooks occur after the seccomp and syscall auditing hooks.  This 
 means that that code is exposed no matter what in this model.  To trim the 
 exposed surface to userspace, we really need those early hooks.  While I can 
 see both hacky and less hacky approaches around this, it stills strikes me 
 that the seccomp thread flag and early interception are good to reuse.  One 
 option might be to allow seccomp to be a secure-syscall event source, but I 
 suspect that lands more on the hack-y side of the fence :)

Agreed, there should be no security compromise imposed on your usecase, at all.

You could move the event callback sooner into the syscall-entry sequence to 
make sure it's the highest priority thing to process?

There's no semantic dependency on its current location so this can be changed 
AFAICS.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-17 Thread Steven Rostedt
On Tue, 2011-05-17 at 14:42 +0200, Ingo Molnar wrote:
 * Steven Rostedt rost...@goodmis.org wrote:
 
  On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote:
   * Steven Rostedt rost...@goodmis.org wrote:
   
I'm a bit nervous about the 'active' role of (trace_)events, because of 
the 
way multiple callbacks can be registered. How would:

err = event_x();
if (err == -EACCESS) {

be handled? [...]
   
   The default behavior would be something obvious: to trigger all callbacks 
   and 
   use the first non-zero return value.
  
  But how do we know which callback that was from? There's no ordering of 
  what 
  callbacks are called first.
 
 We do not have to know that - nor do the calling sites care in general. Do 
 you 
 have some specific usecase in mind where the identity of the callback that 
 generates a match matters?

Maybe I'm confused. I was thinking that these event_*() are what we
currently call trace_*(), but the event_*(), I assume, can return a
value if a call back returns one.

Thus, we now have the ability to dynamically attach function calls to
arbitrary points in the kernel that can have an affect on the code that
called it. Right now, we only have the ability to attach function calls
to these locations that have passive affects (tracing/profiling).

But you say, nor do the calling sites care in general. Then what do
these calling sites do with the return code? Are we limiting these
actions to security only? Or can we have some other feature. I can
envision that we can make the Linux kernel quite dynamic here with self
modifying code. That is, anywhere we have hooks, perhaps we could
replace them with dynamic switches (jump labels). Maybe events would not
be the best use, but they could be a generic one.

Knowing what callback returned the result would be beneficial. Right
now, you are saying if the call back return anything, just abort the
call, not knowing what callback was called.

-- Steve


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-17 Thread Ingo Molnar

* James Morris jmor...@namei.org wrote:

 On Tue, 17 May 2011, Ingo Molnar wrote:
 
  I'm not sure i get your point.
 
 Your example was not complete as described.  After an apparently simple 
 specification, you've since added several qualifiers and assumptions, [...]

I havent added any qualifiers really (i added examples/description), the opt-in 
method i mentioned in my first mail should be pretty robust:

 | Firstly, using the filter code i deny the various link creation syscalls so 
 | that sandboxed code cannot escape for example by creating a symlink to 
 | outside the permitted VFS namespace. (Note: we opt-in to syscalls, that way 
 | new syscalls added by new kernels are denied by defalt. The current symlink 
 | creation syscalls are not opted in to.)

 [...] and I still doubt that it's complete.

I could too claim that i doubt that the SELinux kernel implementation is 
secure!

So how about we both come up with specific examples about how it's not secure, 
instead of going down the fear-uncertainty-and-doubt road? ;-)

 A higher level goal would look like
 
 Allow a sandbox app access only to approved resources, to contain the 
 effects of flaws in the app, or similar.

I see what you mean.

I really think that restricting sandboxed code to only open files within a 
given VFS namespace boundary is the most useful highlevel description here - 
which is really a subset of a allow a sandbox app access only to an easily 
approved set of files highlevel concept.

There's no to contain ... bit here: *all* of the sandboxed app code is 
untrusted, so there's no 'remote attacker' and we do not limit our threat to 
flaws in the app. We want to contain apps to within a small subset of Linux 
functionality, and we want to do that within regular apps (without having to be 
superuser), full stop.

 Note that this includes a threat model (remote attacker taking control of the 
 app) and a general and fully stated strategy for dealing with it.

Attacker does not have to be remote - most sandboxing concepts protect against 
locally installed plugins/apps/applets. In sandboxing the whole app is 
considered untrusted - not just some flaw in it, abused remotely.

 From there, you can start to analyze how to implement the goal, at which 
 point you'd start thinking about configuration, assumptions, filesystem 
 access, namespaces, indirect access (e.g. via sockets, rpc, ipc, shared 
 memory, invocation).

Sandboxed code generally does not have access to anything fancy like that - if 
it is added then all possible side effects have to be examined.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-16 Thread Ingo Molnar

* Arnd Bergmann a...@arndb.de wrote:

 On Saturday 14 May 2011, Will Drewry wrote:
  Depending on integration, it could even be limited to ioctl commands
  that are appropriate to a known fd if the fd is opened prior to
  entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed
  with a filter of 1 then narrowed through a later addition of
  something like (fd == %u  (cmd == %u || cmd == %u)) or something
  along those lines.
  
  Does that make sense?
 
 Thanks for the explanation. This sounds like it's already doing all
 we need.

One thing we could do more clearly here is to help keep the filter expressions 
symbolic - i.e. help resolve the various ioctl variants as well, not just the 
raw syscall parameter numbers.

But yes, access to the raw syscall parameters and the ability to filter them 
already gives us the ability to exclude/include specific ioctls in a rather 
flexible way.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-16 Thread Ingo Molnar

* Will Drewry w...@chromium.org wrote:

  Note, i'm not actually asking for the moon, a pony and more.
 
  I fully submit that we are yet far away from being able to do a full LSM 
  via this mechanism.
 
  What i'm asking for is that because the syscall point steps taken by Will 
  look very promising so it would be nice to do *that* in a slightly more 
  flexible scheme that does not condemn it to be limited to the syscall 
  boundary and such ...
 
 What do you suggest here?
 
 From my brief exploration of the ftrace/perf (and seccomp) code, I don't see 
 a clean way of integrating over the existing interfaces to the ftrace 
 framework (e.g., the global perf event pump seems to be a mismatch), but I 
 may be missing something obvious. [...]

Well, there's no global perf event pump. Here we'd obviously want to use 
buffer-less events that do no output (pumping) whatsoever - i.e. just counting 
events with filters attached.

What i suggest is to:

 - share syscall events# you are fine with that as your patch makes use 
of them
 - share the scripting engine  # you are fine with that as your patch makes use 
of it

 - share *any* other event to do_exit() a process at syscall exit time

 - share any other active event that kernel developers specifically enable 
   for active use to impact security-relevant execution even sooner than 
   syscall exit time - not just system calls

 - share the actual facility that manages (sets/gets) filters

So right now you have this structure for your new feature:

 Documentation/trace/seccomp_filter.txt |   75 +
 arch/x86/kernel/ldt.c  |5 
 arch/x86/kernel/tls.c  |7 
 fs/proc/array.c|   21 +
 fs/proc/base.c |   25 +
 include/linux/ftrace_event.h   |9 
 include/linux/seccomp.h|   98 +++
 include/linux/syscalls.h   |   54 ++--
 include/trace/syscall.h|6 
 kernel/Makefile|1 
 kernel/fork.c  |8 
 kernel/perf_event.c|7 
 kernel/seccomp.c   |  144 ++-
 kernel/seccomp_filter.c|  428 +
 kernel/sys.c   |   11 
 kernel/trace/Kconfig   |   10 
 kernel/trace/trace_events_filter.c |   60 ++--
 kernel/trace/trace_syscalls.c  |   96 ++-
 18 files changed, 986 insertions(+), 79 deletions(-)

Firstly, one specific problem i can see is that kernel/seccomp_filter.c 
hardcodes to the system call boundary. Which is fine to a prototype 
implementation (and obviously fine for something that builds upon seccomp) but 
not fine in terms of a flexible Linux security feature :-)

You have hardcoded these syscall assumptions via:

 struct seccomp_filter {
struct list_head list;
struct rcu_head rcu;
int syscall_nr;
struct syscall_metadata *data;
struct event_filter *event_filter;
 };

Which comes just because you chose to enumerate only syscall events - instead 
of enumerating all events.

Instead of that please bind to all events instead - syscalls are just one of 
the many events we have. Type 'perf list' and see how many event types we have, 
and quite a few could be enabled for 'active feedback' sandboxing as well.

Secondly, and this is really a variant of the first problem you have, the way 
you process event filter 'failures' is pretty limited. You utilize the regular 
seccomp method which works by calling into __secure_computing() and silently 
accepting syscalls or generating a hard do_exit() on even the slightest of 
filter failures.

Instead of that what we'd want to have is to have regular syscall events 
registered, *and* enabled for such active filtering purposes. The moment the 
filter hits such an 'active' event would set the TIF_NOTIFY_RESUME flag and 
some other attribute in the task and the kernel would do a do_exit() at the 
earliest of opportunities before calling the syscall or at the 
return-from-syscall point latest.

Note that no seccomp specific code would have to execute here, we can already 
generate events both at syscall entry and at syscall exit points, the only new 
bit we'd need is for the 'kill the task' [or abort the syscall] policy action.

This is *far* more generic still yields the same short-term end result as far 
as your sandboxing is concerned.

What we'd need for this is a way to mark existing TRACE_EVENT()s as 'active'. 
We'd mark all syscall events as 'active' straight away.

[ Detail: these trace events return a return code, which the calling code can 
  use, that way event return values could be used sooner than syscall exit 
  points. IRQ code could make use of it as well, so for example this way we 
  could filter based early packet inspection, still in the IRQ code. ]

Then what your feature would do is to simply open up the events you are 

Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-16 Thread Ingo Molnar

* Will Drewry w...@chromium.org wrote:

 I agree with you on many of these points!  However, I don't think that the 
 views around LSMs, perf/ftrace infrastructure, or the current seccomp 
 filtering implementation are necessarily in conflict.  Here is my 
 understanding of how the different worlds fit together and where I see this 
 patchset living, along with where I could see future work going.  Perhaps I'm 
 being a trifle naive, but here goes anyway:
 
 1. LSMs provide a global mechanism for hooking security relevant
 events at a point where all the incoming user-sourced data has been
 preprocessed and moved into userspace.  The hooks are called every
 time one of those boundaries are crossed.

 2. Perf and the ftrace infrastructure provide global function tracing
 and system call hooks with direct access to the caller's registers
 (and memory).

No, perf events are not just global but per task as well. Nor are events 
limited to 'tracing' (generating a flow of events into a trace buffer) - they 
can just be themselves as well and count and generate callbacks.

The generic NMI watchdog uses that kind of event model for example, see 
kernel/watchdog.c and how it makes use of struct perf_event abstractions to do 
per CPU events (with no buffrs), or how kernel/hw_breakpoint.c uses it for per 
task events and integrates it with the ptrace hw-breakpoints code.

Ideally Peter's one particular suggestion is right IMO and we'd want to be able 
for a perf_event to just be a list of callbacks, attached to a task and barely 
more than a discoverable, named notifier chain in its slimmest form.

In practice it's fatter than that right now, but we should definitely factor 
out that aspect of it more clearly, both code-wise and API-wise. 
kernel/watchdog.c and kernel/hw_breakpoint.c shows that such factoring out is 
possible and desirable.

 3. seccomp (as it exists today) provides a global system call entry
 hook point with a binary per-process decision about whether to provide
 secure computing behavior.
 
 When I boil that down to abstractions, I see:
 A. Globally scoped: LSMs, ftrace/perf
 B. Locally/process scoped: seccomp

Ok, i see where you got the idea that you needed to cut your surface of 
abstraction at the filter engine / syscall enumeration level - i think you were 
thinking of it in the ftrace model of tracepoints, not in the perf model of 
events.

No, events are generic and as such per task as well, not just global.

I've replied to your other mail with more specific suggestions of how we could 
provide your feature using abstractions that share code more widely. Talking 
specifics will i hope help move the discussion forward! :-)

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-16 Thread Will Drewry
On Mon, May 16, 2011 at 7:55 AM, Ingo Molnar mi...@elte.hu wrote:

 * Will Drewry w...@chromium.org wrote:

 I agree with you on many of these points!  However, I don't think that the
 views around LSMs, perf/ftrace infrastructure, or the current seccomp
 filtering implementation are necessarily in conflict.  Here is my
 understanding of how the different worlds fit together and where I see this
 patchset living, along with where I could see future work going.  Perhaps I'm
 being a trifle naive, but here goes anyway:

 1. LSMs provide a global mechanism for hooking security relevant
 events at a point where all the incoming user-sourced data has been
 preprocessed and moved into userspace.  The hooks are called every
 time one of those boundaries are crossed.

 2. Perf and the ftrace infrastructure provide global function tracing
 and system call hooks with direct access to the caller's registers
 (and memory).

 No, perf events are not just global but per task as well. Nor are events
 limited to 'tracing' (generating a flow of events into a trace buffer) - they
 can just be themselves as well and count and generate callbacks.

I was looking at the perf_sysenter_enable() call, but clearly there is
more going on :)

 The generic NMI watchdog uses that kind of event model for example, see
 kernel/watchdog.c and how it makes use of struct perf_event abstractions to do
 per CPU events (with no buffrs), or how kernel/hw_breakpoint.c uses it for per
 task events and integrates it with the ptrace hw-breakpoints code.

 Ideally Peter's one particular suggestion is right IMO and we'd want to be 
 able
 for a perf_event to just be a list of callbacks, attached to a task and barely
 more than a discoverable, named notifier chain in its slimmest form.

 In practice it's fatter than that right now, but we should definitely factor
 out that aspect of it more clearly, both code-wise and API-wise.
 kernel/watchdog.c and kernel/hw_breakpoint.c shows that such factoring out is
 possible and desirable.

 3. seccomp (as it exists today) provides a global system call entry
 hook point with a binary per-process decision about whether to provide
 secure computing behavior.

 When I boil that down to abstractions, I see:
 A. Globally scoped: LSMs, ftrace/perf
 B. Locally/process scoped: seccomp

 Ok, i see where you got the idea that you needed to cut your surface of
 abstraction at the filter engine / syscall enumeration level - i think you 
 were
 thinking of it in the ftrace model of tracepoints, not in the perf model of
 events.

 No, events are generic and as such per task as well, not just global.

 I've replied to your other mail with more specific suggestions of how we could
 provide your feature using abstractions that share code more widely. Talking
 specifics will i hope help move the discussion forward! :-)

Agreed.  I'll digest both the watchdog code as well as your other
comments and follow up when I have a fuller picture in my head.

(I have a few initial comments I'll post in response to your other mail.)

Thanks!
will
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-16 Thread Ingo Molnar

* James Morris jmor...@namei.org wrote:

 On Fri, 13 May 2011, Ingo Molnar wrote:
 
  Say i'm a user-space sandbox developer who wants to enforce that sandboxed 
  code should only be allowed to open files in /home/sandbox/, /lib/ and 
  /usr/lib/.
  
  It is a simple and sensible security feature, agreed? It allows most code 
  to run well and link to countless libraries - but no access to other files 
  is allowed.
 
 Not really.
 
 Firstly, what is the security goal of these restrictions? [...]

To do what i described above? Namely:

  Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/
   and /usr/lib/ 

 [...]  Then, are the restrictions complete and unbypassable?

If only the system calls i mentioned are allowed, and if the sandboxed VFS 
namespace itself is isolated from the rest of the system (no bind mounts, no 
hard links outside the sandbox, etc.) then its goal is to not be bypassable - 
what use is a sandbox if the sandbox can be bypassed by the sandboxed code?

There's a few ways how to alter (and thus bypass) VFS namespace lookups: 
symlinks, chdir, chroot, rename, etc., which (as i mentioned) have to be 
excluded by default or filtered as well.

 How do you reason about the behavior of the system as a whole?

For some usecases i mainly want to reason about what the sandboxed code can do 
and can not do, within a fairly static and limited VFS namespace environment.

I might not want to have a full-blown 'physical barrier' for all objects 
labeled as inaccessible to sandboxed code (or labeled as accessible to 
sandboxed code).

Especially as manipulating file labels is not also slow (affects all files) but 
is also often an exclusively privileged operation even for owned files, for no 
good reason. For things like /lib/ and /usr/lib/ it also *has* to be a 
privileged operation.

  I argue that this is the LSM and audit subsystems designed right: in the 
  long run it could allow everything that LSM does at the moment - and so 
  much more ...
 
 Now you're proposing a redesign of the security subsystem.  That's a 
 significant undertaking.

It certainly is.

 In the meantime, we have a simple, well-defined enhancement to seccomp which 
 will be very useful to current users in reducing their kernel attack surface.
 
 We should merge that, and the security subsystem discussion can carry on 
 separately.

Is that the development and merge process along which the LSM subsystem got 
into its current state?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-16 Thread Will Drewry
On Mon, May 16, 2011 at 10:26 AM, Steven Rostedt rost...@goodmis.org wrote:
 Sorry to be absent from this thread so far, I just got back from my
 travels and I'm now catching up on email.


 On Wed, 2011-05-11 at 22:02 -0500, Will Drewry wrote:

 diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
 index 377a7a5..22e1668 100644
 --- a/arch/arm/Kconfig
 +++ b/arch/arm/Kconfig
 @@ -1664,6 +1664,16 @@ config SECCOMP
         and the task is only allowed to execute a few safe syscalls
         defined by each seccomp mode.

 +config SECCOMP_FILTER
 +     bool Enable seccomp-based system call filtering
 +     depends on SECCOMP  EXPERIMENTAL
 +     help
 +       Per-process, inherited system call filtering using shared code
 +       across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
 +       is not available, enhanced filters will not be available.
 +
 +       See Documentation/prctl/seccomp_filter.txt for more detail.
 +
  config CC_STACKPROTECTOR
       bool Enable -fstack-protector buffer overflow detection 
 (EXPERIMENTAL)
       depends on EXPERIMENTAL
 diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
 index eccdefe..7641ee9 100644
 --- a/arch/microblaze/Kconfig
 +++ b/arch/microblaze/Kconfig
 @@ -129,6 +129,16 @@ config SECCOMP

         If unsure, say Y. Only embedded should say N here.

 +config SECCOMP_FILTER
 +     bool Enable seccomp-based system call filtering
 +     depends on SECCOMP  EXPERIMENTAL
 +     help
 +       Per-process, inherited system call filtering using shared code
 +       across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
 +       is not available, enhanced filters will not be available.
 +
 +       See Documentation/prctl/seccomp_filter.txt for more detail.
 +
  endmenu

  menu Advanced setup
 diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
 index 8e256cc..fe4cbda 100644
 --- a/arch/mips/Kconfig
 +++ b/arch/mips/Kconfig
 @@ -2245,6 +2245,16 @@ config SECCOMP

         If unsure, say Y. Only embedded should say N here.

 +config SECCOMP_FILTER
 +     bool Enable seccomp-based system call filtering
 +     depends on SECCOMP  EXPERIMENTAL
 +     help
 +       Per-process, inherited system call filtering using shared code
 +       across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
 +       is not available, enhanced filters will not be available.
 +
 +       See Documentation/prctl/seccomp_filter.txt for more detail.
 +
  config USE_OF
       bool Flattened Device Tree support
       select OF
 diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
 index 8f4d50b..83499e4 100644
 --- a/arch/powerpc/Kconfig
 +++ b/arch/powerpc/Kconfig
 @@ -605,6 +605,16 @@ config SECCOMP

         If unsure, say Y. Only embedded should say N here.

 +config SECCOMP_FILTER
 +     bool Enable seccomp-based system call filtering
 +     depends on SECCOMP  EXPERIMENTAL
 +     help
 +       Per-process, inherited system call filtering using shared code
 +       across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
 +       is not available, enhanced filters will not be available.
 +
 +       See Documentation/prctl/seccomp_filter.txt for more detail.
 +
  endmenu

  config ISA_DMA_API
 diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
 index 2508a6f..2777515 100644
 --- a/arch/s390/Kconfig
 +++ b/arch/s390/Kconfig
 @@ -614,6 +614,16 @@ config SECCOMP

         If unsure, say Y.

 +config SECCOMP_FILTER
 +     bool Enable seccomp-based system call filtering
 +     depends on SECCOMP  EXPERIMENTAL
 +     help
 +       Per-process, inherited system call filtering using shared code
 +       across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
 +       is not available, enhanced filters will not be available.
 +
 +       See Documentation/prctl/seccomp_filter.txt for more detail.
 +
  endmenu

  menu Power Management
 diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
 index 4b89da2..00c1521 100644
 --- a/arch/sh/Kconfig
 +++ b/arch/sh/Kconfig
 @@ -676,6 +676,16 @@ config SECCOMP

         If unsure, say N.

 +config SECCOMP_FILTER
 +     bool Enable seccomp-based system call filtering
 +     depends on SECCOMP  EXPERIMENTAL
 +     help
 +       Per-process, inherited system call filtering using shared code
 +       across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
 +       is not available, enhanced filters will not be available.
 +
 +       See Documentation/prctl/seccomp_filter.txt for more detail.
 +
  config SMP
       bool Symmetric multi-processing support
       depends on SYS_SUPPORTS_SMP
 diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
 index e560d10..5b42255 100644
 --- a/arch/sparc/Kconfig
 +++ b/arch/sparc/Kconfig
 @@ -270,6 +270,16 @@ config SECCOMP

         If unsure, say Y. Only embedded should say N here.

 +config SECCOMP_FILTER
 +     bool Enable seccomp-based system call filtering
 +     depends on SECCOMP  EXPERIMENTAL
 +     help
 +       Per-process, inherited system call filtering using shared code
 + 

Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-16 Thread Will Drewry
On Mon, May 16, 2011 at 7:43 AM, Ingo Molnar mi...@elte.hu wrote:

 * Will Drewry w...@chromium.org wrote:

  Note, i'm not actually asking for the moon, a pony and more.
 
  I fully submit that we are yet far away from being able to do a full LSM
  via this mechanism.
 
  What i'm asking for is that because the syscall point steps taken by Will
  look very promising so it would be nice to do *that* in a slightly more
  flexible scheme that does not condemn it to be limited to the syscall
  boundary and such ...

 What do you suggest here?

 From my brief exploration of the ftrace/perf (and seccomp) code, I don't see
 a clean way of integrating over the existing interfaces to the ftrace
 framework (e.g., the global perf event pump seems to be a mismatch), but I
 may be missing something obvious. [...]

 Well, there's no global perf event pump. Here we'd obviously want to use
 buffer-less events that do no output (pumping) whatsoever - i.e. just counting
 events with filters attached.

Cool - I missed these entirely.  I'll get reading :)

 What i suggest is to:

  - share syscall events        # you are fine with that as your patch makes 
 use of them
  - share the scripting engine  # you are fine with that as your patch makes 
 use of it

  - share *any* other event to do_exit() a process at syscall exit time

  - share any other active event that kernel developers specifically enable
   for active use to impact security-relevant execution even sooner than
   syscall exit time - not just system calls

  - share the actual facility that manages (sets/gets) filters

These make sense to me at a high level.  I'll throw in a few initial
comments, but I'll be back for a round-two once I catch up on the rest
of the perf code.

 So right now you have this structure for your new feature:

  Documentation/trace/seccomp_filter.txt |   75 +
  arch/x86/kernel/ldt.c                  |    5
  arch/x86/kernel/tls.c                  |    7
  fs/proc/array.c                        |   21 +
  fs/proc/base.c                         |   25 +
  include/linux/ftrace_event.h           |    9
  include/linux/seccomp.h                |   98 +++
  include/linux/syscalls.h               |   54 ++--
  include/trace/syscall.h                |    6
  kernel/Makefile                        |    1
  kernel/fork.c                          |    8
  kernel/perf_event.c                    |    7
  kernel/seccomp.c                       |  144 ++-
  kernel/seccomp_filter.c                |  428 
 +
  kernel/sys.c                           |   11
  kernel/trace/Kconfig                   |   10
  kernel/trace/trace_events_filter.c     |   60 ++--
  kernel/trace/trace_syscalls.c          |   96 ++-
  18 files changed, 986 insertions(+), 79 deletions(-)

 Firstly, one specific problem i can see is that kernel/seccomp_filter.c
 hardcodes to the system call boundary. Which is fine to a prototype
 implementation (and obviously fine for something that builds upon seccomp) but
 not fine in terms of a flexible Linux security feature :-)

 You have hardcoded these syscall assumptions via:

  struct seccomp_filter {
        struct list_head list;
        struct rcu_head rcu;
        int syscall_nr;
        struct syscall_metadata *data;
        struct event_filter *event_filter;
  };

(This structure is a bit different in the new rev of the patch, but
nothing relevant to this specific part of the discussion :)

 Which comes just because you chose to enumerate only syscall events - instead
 of enumerating all events.

While I'm willing to live with the tradeoff, using ftrace event
numbers from FTRACE_SYSCALLS means that the filter will be unable to
hook a fair number of syscalls: execve, clone, etc  (all the ptregs
fixup syscalls on x86) and anything that returns int instead of long
(on x86).  Though the last two patches in the initial patch series
provided a proposed clean up for the latter :/

The current revision of the seccomp filter patch can function in a
bitmask-like state when CONFIG_FTRACE is unset or
CONFIG_FTRACE_SYSCALLS is unset.  This also means any platform with
CONFIG_SECCOMP support can start using this right away, but I realize
that is more of a short-term gain rather than a long-term one.

 Instead of that please bind to all events instead - syscalls are just one of
 the many events we have. Type 'perf list' and see how many event types we 
 have,
 and quite a few could be enabled for 'active feedback' sandboxing as well.

 Secondly, and this is really a variant of the first problem you have, the way
 you process event filter 'failures' is pretty limited. You utilize the regular
 seccomp method which works by calling into __secure_computing() and silently
 accepting syscalls or generating a hard do_exit() on even the slightest of
 filter failures.

 Instead of that what we'd want to have is to have regular syscall events
 registered, *and* enabled for such active filtering purposes. 

Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-16 Thread Steven Rostedt
Sorry to be absent from this thread so far, I just got back from my
travels and I'm now catching up on email.


On Wed, 2011-05-11 at 22:02 -0500, Will Drewry wrote:

 diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
 index 377a7a5..22e1668 100644
 --- a/arch/arm/Kconfig
 +++ b/arch/arm/Kconfig
 @@ -1664,6 +1664,16 @@ config SECCOMP
 and the task is only allowed to execute a few safe syscalls
 defined by each seccomp mode.
  
 +config SECCOMP_FILTER
 + bool Enable seccomp-based system call filtering
 + depends on SECCOMP  EXPERIMENTAL
 + help
 +   Per-process, inherited system call filtering using shared code
 +   across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
 +   is not available, enhanced filters will not be available.
 +
 +   See Documentation/prctl/seccomp_filter.txt for more detail.
 +
  config CC_STACKPROTECTOR
   bool Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)
   depends on EXPERIMENTAL
 diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
 index eccdefe..7641ee9 100644
 --- a/arch/microblaze/Kconfig
 +++ b/arch/microblaze/Kconfig
 @@ -129,6 +129,16 @@ config SECCOMP
  
 If unsure, say Y. Only embedded should say N here.
  
 +config SECCOMP_FILTER
 + bool Enable seccomp-based system call filtering
 + depends on SECCOMP  EXPERIMENTAL
 + help
 +   Per-process, inherited system call filtering using shared code
 +   across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
 +   is not available, enhanced filters will not be available.
 +
 +   See Documentation/prctl/seccomp_filter.txt for more detail.
 +
  endmenu
  
  menu Advanced setup
 diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
 index 8e256cc..fe4cbda 100644
 --- a/arch/mips/Kconfig
 +++ b/arch/mips/Kconfig
 @@ -2245,6 +2245,16 @@ config SECCOMP
  
 If unsure, say Y. Only embedded should say N here.
  
 +config SECCOMP_FILTER
 + bool Enable seccomp-based system call filtering
 + depends on SECCOMP  EXPERIMENTAL
 + help
 +   Per-process, inherited system call filtering using shared code
 +   across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
 +   is not available, enhanced filters will not be available.
 +
 +   See Documentation/prctl/seccomp_filter.txt for more detail.
 +
  config USE_OF
   bool Flattened Device Tree support
   select OF
 diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
 index 8f4d50b..83499e4 100644
 --- a/arch/powerpc/Kconfig
 +++ b/arch/powerpc/Kconfig
 @@ -605,6 +605,16 @@ config SECCOMP
  
 If unsure, say Y. Only embedded should say N here.
  
 +config SECCOMP_FILTER
 + bool Enable seccomp-based system call filtering
 + depends on SECCOMP  EXPERIMENTAL
 + help
 +   Per-process, inherited system call filtering using shared code
 +   across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
 +   is not available, enhanced filters will not be available.
 +
 +   See Documentation/prctl/seccomp_filter.txt for more detail.
 +
  endmenu
  
  config ISA_DMA_API
 diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
 index 2508a6f..2777515 100644
 --- a/arch/s390/Kconfig
 +++ b/arch/s390/Kconfig
 @@ -614,6 +614,16 @@ config SECCOMP
  
 If unsure, say Y.
  
 +config SECCOMP_FILTER
 + bool Enable seccomp-based system call filtering
 + depends on SECCOMP  EXPERIMENTAL
 + help
 +   Per-process, inherited system call filtering using shared code
 +   across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
 +   is not available, enhanced filters will not be available.
 +
 +   See Documentation/prctl/seccomp_filter.txt for more detail.
 +
  endmenu
  
  menu Power Management
 diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
 index 4b89da2..00c1521 100644
 --- a/arch/sh/Kconfig
 +++ b/arch/sh/Kconfig
 @@ -676,6 +676,16 @@ config SECCOMP
  
 If unsure, say N.
  
 +config SECCOMP_FILTER
 + bool Enable seccomp-based system call filtering
 + depends on SECCOMP  EXPERIMENTAL
 + help
 +   Per-process, inherited system call filtering using shared code
 +   across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
 +   is not available, enhanced filters will not be available.
 +
 +   See Documentation/prctl/seccomp_filter.txt for more detail.
 +
  config SMP
   bool Symmetric multi-processing support
   depends on SYS_SUPPORTS_SMP
 diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
 index e560d10..5b42255 100644
 --- a/arch/sparc/Kconfig
 +++ b/arch/sparc/Kconfig
 @@ -270,6 +270,16 @@ config SECCOMP
  
 If unsure, say Y. Only embedded should say N here.
  
 +config SECCOMP_FILTER
 + bool Enable seccomp-based system call filtering
 + depends on SECCOMP  EXPERIMENTAL
 + help
 +   Per-process, inherited system call filtering using shared code
 +   across seccomp and ftrace_syscalls.  If 

Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-16 Thread Ingo Molnar

* Steven Rostedt rost...@goodmis.org wrote:

 I'm a bit nervous about the 'active' role of (trace_)events, because of the 
 way multiple callbacks can be registered. How would:
 
   err = event_x();
   if (err == -EACCESS) {
 
 be handled? [...]

The default behavior would be something obvious: to trigger all callbacks and 
use the first non-zero return value.

 [...] Would we need a way to prioritize which call back gets the return 
 value? One way I guess would be to add a check_event option, where you pass 
 in an ENUM of the event you want:
 
   event_x();
   err = check_event_x(MYEVENT);
 
 If something registered itself as MYEVENT to event_x, then you get the 
 return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or 
 something could be returned. I'm sure we could even optimize it such a way if 
 no active events have been registered to event_x, that check_event_x() will 
 return -ENODEV without any branches.

I would keep it simple and extensible - that way we can complicate it when the 
need arises! :)

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-16 Thread Steven Rostedt
On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote:
 * Steven Rostedt rost...@goodmis.org wrote:
 
  I'm a bit nervous about the 'active' role of (trace_)events, because of the 
  way multiple callbacks can be registered. How would:
  
  err = event_x();
  if (err == -EACCESS) {
  
  be handled? [...]
 
 The default behavior would be something obvious: to trigger all callbacks and 
 use the first non-zero return value.


But how do we know which callback that was from? There's no ordering of
what callbacks are called first.

 
  [...] Would we need a way to prioritize which call back gets the return 
  value? One way I guess would be to add a check_event option, where you pass 
  in an ENUM of the event you want:
  
  event_x();
  err = check_event_x(MYEVENT);
  
  If something registered itself as MYEVENT to event_x, then you get the 
  return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or 
  something could be returned. I'm sure we could even optimize it such a way 
  if 
  no active events have been registered to event_x, that check_event_x() will 
  return -ENODEV without any branches.
 
 I would keep it simple and extensible - that way we can complicate it when 
 the 
 need arises! :)

The above is rather trivial to implement. I don't think it complicates
anything.

-- Steve


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-16 Thread James Morris
On Mon, 16 May 2011, Ingo Molnar wrote:

  Not really.
  
  Firstly, what is the security goal of these restrictions? [...]
 
 To do what i described above? Namely:
 
   Sandboxed code should only be allowed to open files in /home/sandbox/, 
 /lib/
and /usr/lib/ 

These are access rules, they don't really describe a high-level security 
goal.  How do you know it's ok to open everything in these directories?


- James
-- 
James Morris
jmor...@namei.org
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-15 Thread Arnd Bergmann
On Saturday 14 May 2011, Will Drewry wrote:
 Depending on integration, it could even be limited to ioctl commands
 that are appropriate to a known fd if the fd is opened prior to
 entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed
 with a filter of 1 then narrowed through a later addition of
 something like (fd == %u  (cmd == %u || cmd == %u)) or something
 along those lines.
 
 Does that make sense?

Thanks for the explanation. This sounds like it's already doing all
we need.

Arnd
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-15 Thread James Morris
On Fri, 13 May 2011, Ingo Molnar wrote:

 Say i'm a user-space sandbox developer who wants to enforce that sandboxed 
 code 
 should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/.
 
 It is a simple and sensible security feature, agreed? It allows most code to 
 run well and link to countless libraries - but no access to other files is 
 allowed.

Not really.

Firstly, what is the security goal of these restrictions?  Then, are the 
restrictions complete and unbypassable?

How do you reason about the behavior of the system as a whole?


 I argue that this is the LSM and audit subsystems designed right: in the long 
 run it could allow everything that LSM does at the moment - and so much more 
 ...

Now you're proposing a redesign of the security subsystem.  That's a 
significant undertaking.

In the meantime, we have a simple, well-defined enhancement to seccomp 
which will be very useful to current users in reducing their kernel attack 
surface.

We should merge that, and the security subsystem discussion can carry on 
separately.


- James
-- 
James Morris
jmor...@namei.org
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-14 Thread Ingo Molnar

* Peter Zijlstra pet...@infradead.org wrote:

 On Fri, 2011-05-13 at 16:57 +0200, Ingo Molnar wrote:
  this is a security mechanism
 
 Who says? [...]

Kernel developers/maintainers of the affected code.

We have security hooks all around the kernel, which can deny/accept execution 
at various key points, but we do not have 'execute arbitrary user-space defined 
(safe) scripts' callbacks in general.

But yes, if a particular callback point is defined widely enough to allow much 
bigger intervention into the flow of execution, then more is possible as well.

 [...] and why would you want to unify two separate concepts only to them 
 limit it to security that just doesn't make sense.

I don't limit them to security - the callbacks themselves are either for 
passive observation or, at most, for security accept/deny callbacks.

It's decided by the subsystem maintainers what kind of user-space control power 
(or observation power) they want to allow, not me.

I would just like to not stop the facility itself at the 'observe only' level, 
like you suggest.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-14 Thread Will Drewry
On Sat, May 14, 2011 at 2:30 AM, Ingo Molnar mi...@elte.hu wrote:

 * Eric Paris epa...@redhat.com wrote:

 [dropping microblaze and roland]

 lOn Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote:
  * James Morris jmor...@namei.org wrote:

  It is a simple and sensible security feature, agreed? It allows most code 
  to
  run well and link to countless libraries - but no access to other files is
  allowed.

 It's simple enough and sounds reasonable, but you can read all the discussion
 about AppArmour why many people don't really think it's the best. [...]

 I have to say most of the participants of the AppArmour flamefests were dead
 wrong, and it wasnt the AppArmour folks who were wrong ...

 The straight ASCII VFS namespace *makes sense*, and yes, the raw physical
 objects space that SELinux uses makes sense as well.

 And no, i do not subscribe to the dogma that it is not possible to secure the
 ASCII VFS namespace: it evidently is possible, if you know and handle the
 ambiguitites. It is also obviously true that the ASCII VFS namespaces we use
 every day are a *lot* more intuitive than the labeled physical objects space
 ...

 What all the security flamewars missed is the simple fact that being intuitive
 matters a *lot* not just to not annoy users, but also to broaden the effective
 number of security-conscious developers ...

  Unfortunately this audit callback cannot be used for my purposes, because
  the event is single-purpose for auditd and because it allows no feedback
  (no deny/accept discretion for the security policy).
 
  But if had this simple event there:
 
      err = event_vfs_getname(result);

 Wow it sounds so easy.  Now lets keep extending your train of thought
 until we can actually provide the security provided by SELinux.  What do
 we end up with?  We end up with an event hook right next to every LSM
 hook.  You know, the LSM hooks were placed where they are for a reason.
 Because those were the locations inside the kernel where you actually
 have information about the task doing an operation and the objects
 (files, sockets, directories, other tasks, etc) they are doing an
 operation on.

 Honestly all you are talking about it remaking the LSM with 2 sets of
 hooks instead if 1.  Why? [...]

 Not at all. I am taking about using *one* set of events, to keep the intrusion
 at the lowest possible level.

 LSM could make use of them as well.

 Obviously for pragmatic reasons that might not be feasible initially.

 [...]  It seems much easier that if you want the language of the filter
 engine you would just make a new LSM that uses the filter engine for it's
 policy language rather than the language created by SELinux or SMACK or name
 your LSM implementation.

 Correct, that is what i suggested.

 Note that performance is a primary concern, so if certain filters are very
 popular then in practice this would come with support for a couple of 'built
 in' (pre-optimized) filters that the kernel can accelerate directly, so that 
 we
 do not incure the cost of executing the filter preds for really common-sense
 security policies that almost everyone is using.

 I.e. in the end we'd *roughly* end up with the same performance and security 
 as
 we are today (i mean, SELinux and the other LSMs did a nice job of collecting
 the things that apps should be careful about), but the crutial difference isnt
 just the advantages i menioned, but the fact that the *development model* of
 security modules would be a *lot* more extensible.

 So security efforts could move to a whole different level: they could move 
 into
 key apps and they could integrate with the general mind-set of developers.

 At least Will as an application framework developer cares, so that hope is
 justified i think.

   - unprivileged:  application-definable, allowing the embedding of security
                    policy in *apps* as well, not just the system
 
   - flexible:      can be added/removed runtime unprivileged, and cheaply so
 
   - transparent:   does not impact executing code that meets the policy
 
   - nestable:      it is inherited by child tasks and is fundamentally 
  stackable,
                    multiple policies will have the combined effect and they
                    are transparent to each other. So if a child task within 
  a
                    sandbox adds *more* checks then those add to the already
                    existing set of checks. We only narrow permissions, never
                    extend them.
 
   - generic:       allowing observation and (safe) control of security 
  relevant
                    parameters not just at the system call boundary but at 
  other
                    relevant places of kernel execution as well: which
                    points/callbacks could also be used for other types of 
  event
                    extraction such as perf. It could even be shared with 
  audit ...

 I'm not arguing that any of these things are bad things.  What you describe
 is a 

Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-14 Thread Arnd Bergmann
On Thursday 12 May 2011, Will Drewry wrote:
 This change adds a new seccomp mode based on the work by
 a...@chromium.org in [1]. This new mode, filter mode, provides a hash
 table of seccomp_filter objects.  When in the new mode (2), all system
 calls are checked against the filters - first by system call number,
 then by a filter string.  If an entry exists for a given system call and
 all filter predicates evaluate to true, then the task may proceed.
 Otherwise, the task is killed (as per seccomp_mode == 1).

I've got a question about this: Do you expect the typical usage to disallow
ioctl()? Given that ioctl alone is responsible for a huge number of exploits
in various drivers, while certain ioctls are immensely useful (FIONREAD,
FIOASYNC, ...), do you expect to extend the mechanism to filter specific
ioctl commands in the future?

Arnd

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-14 Thread Will Drewry
On Fri, May 13, 2011 at 2:35 PM, Arnd Bergmann a...@arndb.de wrote:
 On Thursday 12 May 2011, Will Drewry wrote:
 This change adds a new seccomp mode based on the work by
 a...@chromium.org in [1]. This new mode, filter mode, provides a hash
 table of seccomp_filter objects.  When in the new mode (2), all system
 calls are checked against the filters - first by system call number,
 then by a filter string.  If an entry exists for a given system call and
 all filter predicates evaluate to true, then the task may proceed.
 Otherwise, the task is killed (as per seccomp_mode == 1).

 I've got a question about this: Do you expect the typical usage to disallow
 ioctl()? Given that ioctl alone is responsible for a huge number of exploits
 in various drivers, while certain ioctls are immensely useful (FIONREAD,
 FIOASYNC, ...), do you expect to extend the mechanism to filter specific
 ioctl commands in the future?

In many cases, I do expect ioctl's to be dropped, but it's totally up
to whoever is setting the filters.

As is, it can already help out: [even though an LSM, if available,
would be appropriate to define a fine-grained policy]

ioctl() is hooked by the ftrace syscalls infrastructure (via SYSCALL_DEFINE3):

  SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned
long, arg)

This means you can do:
  sprintf(filter, cmd == %u || cmd == %u, FIOASYNC, FIONREAD);
  prctl(PR_SET_SECCOMP_FILTER, __NR_ioctl, filter);
  ...
  prctl(PR_SET_SECCOMP, 2, 0);
and then you'll be able to call ioctl on any fd with any argument but
limited to only the FIOASYNC and FIONREAD commands.

Depending on integration, it could even be limited to ioctl commands
that are appropriate to a known fd if the fd is opened prior to
entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed
with a filter of 1 then narrowed through a later addition of
something like (fd == %u  (cmd == %u || cmd == %u)) or something
along those lines.

Does that make sense?

In general, this interface won't need specific extensions for most
system call oriented filtering events.  ftrace events may be expanded
(to include more system calls), but that's behind the scenes.  Only
arguments subject to time-of-check-time-of-use attacks (data living in
userspace passed in by pointer) are not safe to use via this
interface.  In theory, that limitation could also be lifted in the
implementation without changing the ABI.

Thanks!
will
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Ingo Molnar

* Peter Zijlstra pet...@infradead.org wrote:

 On Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote:
  err = event_vfs_getname(result);
 
 I really think we should not do this. Events like we have them should be 
 inactive, totally passive entities, only observe but not affect execution 
 (other than the bare minimal time delay introduced by observance).

Well, this patchset already demonstrates that we can use a single event 
callback for a rather useful purpose.

Either it makes sense to do, in which case we should share facilities as much 
as possible, or it makes no sense, in which case we should not merge it at all.

 If you want another entity that is more active, please invent a new name for 
 it and create a new subsystem for them, now you could have these active 
 entities also have an (automatic) passive event side, but that's some detail.

Why should we have two callbacks next to each other:

event_vfs_getname(result);
result = check_event_vfs_getname(result);

if one could do it all?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Peter Zijlstra
On Fri, 2011-05-13 at 14:26 +0200, Ingo Molnar wrote:
 * Peter Zijlstra pet...@infradead.org wrote:
 
  On Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote:
   err = event_vfs_getname(result);
  
  I really think we should not do this. Events like we have them should be 
  inactive, totally passive entities, only observe but not affect execution 
  (other than the bare minimal time delay introduced by observance).
 
 Well, this patchset already demonstrates that we can use a single event 
 callback for a rather useful purpose.

Can and should are two distinct things.

 Either it makes sense to do, in which case we should share facilities as much 
 as possible, or it makes no sense, in which case we should not merge it at 
 all.

And I'm arguing we should _not_. Observing is radically different from
Affecting, at the very least the two things should have different
permission schemes. We should not confuse these two matters.

  If you want another entity that is more active, please invent a new name 
  for 
  it and create a new subsystem for them, now you could have these active 
  entities also have an (automatic) passive event side, but that's some 
  detail.
 
 Why should we have two callbacks next to each other:
 
   event_vfs_getname(result);
   result = check_event_vfs_getname(result);
 
 if one could do it all?

Did you actually read the bit where I said that check_event_* (although
I still think that name sucks) could imply a matching event_*?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Peter Zijlstra
On Fri, 2011-05-13 at 14:39 +0200, Peter Zijlstra wrote:
 
event_vfs_getname(result);
result = check_event_vfs_getname(result); 

Another fundamental difference is how to treat the callback chains for
these two.

Observers won't have a return value and are assumed to never fail,
therefore we can always call every entry on the callback list.

Active things otoh do have a return value, and thus we need to have
semantics that define what to do with that during callback iteration,
when to continue and when to break. Thus for active elements its
impossible to guarantee all entries will indeed be called.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Ingo Molnar

* Peter Zijlstra pet...@infradead.org wrote:

  Why should we have two callbacks next to each other:
  
  event_vfs_getname(result);
  result = check_event_vfs_getname(result);
  
  if one could do it all?
 
 Did you actually read the bit where I said that check_event_* (although
 I still think that name sucks) could imply a matching event_*?

No, did not notice that - and yes that solves this particular problem.

So given that by your own admission it makes sense to share the facilities at 
the low level, i also argue that it makes sense to share as high up as 
possible.

Are you perhaps arguing for a -observe flag that would make 100% sure that the 
default behavior for events is observe-only? That would make sense indeed.

Otherwise both cases really want to use all the same facilities for event 
discovery, setup, control and potential extraction of events.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Ingo Molnar

* Peter Zijlstra pet...@infradead.org wrote:

 On Fri, 2011-05-13 at 14:39 +0200, Peter Zijlstra wrote:
  
 event_vfs_getname(result);
 result = check_event_vfs_getname(result); 
 
 Another fundamental difference is how to treat the callback chains for
 these two.
 
 Observers won't have a return value and are assumed to never fail,
 therefore we can always call every entry on the callback list.
 
 Active things otoh do have a return value, and thus we need to have
 semantics that define what to do with that during callback iteration,
 when to continue and when to break. Thus for active elements its
 impossible to guarantee all entries will indeed be called.

I think the sanest semantics is to run all active callbacks as well.

For example if this is used for three stacked security policies - as if 3 LSM 
modules were stacked at once. We'd call all three, and we'd determine that at 
least one failed - and we'd return a failure.

Even if the first one failed already we'd still want to trigger *all* the 
failures, because security policies like to know when they have triggered a 
failure (regardless of other active policies) and want to see that failure 
event (if they are logging such events).

So to me this looks pretty similar to observer callbacks as well, it's the 
natural extension to an observer callback chain.

Observer callbacks are simply constant functions (to the caller), those which 
never return failure and which never modify any of the parameters.

It's as if you argued that there should be separate syscalls/facilities for 
handling readonly files versus handling read/write files.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Peter Zijlstra
On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote:
 I think the sanest semantics is to run all active callbacks as well.
 
 For example if this is used for three stacked security policies - as if 3 LSM 
 modules were stacked at once. We'd call all three, and we'd determine that at 
 least one failed - and we'd return a failure. 

But that only works for boolean functions where you can return the
multi-bit-or of the result. What if you need to return the specific
error code.

Also, there's bound to be other cases where people will want to employ
this, look at all the various notifier chain muck we've got, it already
deals with much of this -- simply because users need it.

Then there's the whole indirection argument, if you don't need
indirection, its often better to not use it, I myself much prefer code
to look like:

   foo1(bar);
   foo2(bar);
   foo3(bar);

Than:

   foo_notifier(bar);

Simply because its much clearer who all are involved without me having
to grep around to see who registers for foo_notifier and wth they do
with it. It also makes it much harder to sneak in another user, whereas
its nearly impossible to find new notifier users.

Its also much faster, no extra memory accesses, no indirect function
calls, no other muck.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Ingo Molnar

* Peter Zijlstra pet...@infradead.org wrote:

 On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote:
  I think the sanest semantics is to run all active callbacks as well.
  
  For example if this is used for three stacked security policies - as if 3 
  LSM 
  modules were stacked at once. We'd call all three, and we'd determine that 
  at 
  least one failed - and we'd return a failure. 
 
 But that only works for boolean functions where you can return the
 multi-bit-or of the result. What if you need to return the specific
 error code.

Do you mean that one filter returns -EINVAL while the other -EACCES?

Seems like a non-problem to me, we'd return the first nonzero value.

 Also, there's bound to be other cases where people will want to employ
 this, look at all the various notifier chain muck we've got, it already
 deals with much of this -- simply because users need it.

Do you mean it would be easy to abuse it? What kind of abuse are you most 
worried about?

 Then there's the whole indirection argument, if you don't need
 indirection, its often better to not use it, I myself much prefer code
 to look like:
 
foo1(bar);
foo2(bar);
foo3(bar);
 
 Than:
 
foo_notifier(bar);
 
 Simply because its much clearer who all are involved without me having
 to grep around to see who registers for foo_notifier and wth they do
 with it. It also makes it much harder to sneak in another user, whereas
 its nearly impossible to find new notifier users.
 
 Its also much faster, no extra memory accesses, no indirect function
 calls, no other muck.

But i suspect this question has been settled, given the fact that even pure 
observer events need and already process a chain of events? Am i missing 
something about your argument?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Peter Zijlstra
On Fri, 2011-05-13 at 14:49 +0200, Ingo Molnar wrote:
 
 So given that by your own admission it makes sense to share the facilities at 
 the low level, i also argue that it makes sense to share as high up as 
 possible. 

I'm not saying any such thing, I'm saying that it might make sense to
observe active objects and auto-create these observation points. That
doesn't make them similar or make them share anything.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Peter Zijlstra
Cut the microblaze list since its bouncy.

On Fri, 2011-05-13 at 15:18 +0200, Ingo Molnar wrote:
 * Peter Zijlstra pet...@infradead.org wrote:
 
  On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote:
   I think the sanest semantics is to run all active callbacks as well.
   
   For example if this is used for three stacked security policies - as if 3 
   LSM 
   modules were stacked at once. We'd call all three, and we'd determine 
   that at 
   least one failed - and we'd return a failure. 
  
  But that only works for boolean functions where you can return the
  multi-bit-or of the result. What if you need to return the specific
  error code.
 
 Do you mean that one filter returns -EINVAL while the other -EACCES?
 
 Seems like a non-problem to me, we'd return the first nonzero value.

Assuming the first is -EINVAL, what then is the value in computing the
-EACCESS? Sounds like a massive waste of time to me.

  Also, there's bound to be other cases where people will want to employ
  this, look at all the various notifier chain muck we've got, it already
  deals with much of this -- simply because users need it.
 
 Do you mean it would be easy to abuse it? What kind of abuse are you most 
 worried about?

I'm not worried about abuse, I'm saying that going by the existing
notifier pattern always visiting all entries on the callback list is
undesired.

  Then there's the whole indirection argument, if you don't need
  indirection, its often better to not use it, I myself much prefer code
  to look like:
  
 foo1(bar);
 foo2(bar);
 foo3(bar);
  
  Than:
  
 foo_notifier(bar);
  
  Simply because its much clearer who all are involved without me having
  to grep around to see who registers for foo_notifier and wth they do
  with it. It also makes it much harder to sneak in another user, whereas
  its nearly impossible to find new notifier users.
  
  Its also much faster, no extra memory accesses, no indirect function
  calls, no other muck.
 
 But i suspect this question has been settled, given the fact that even pure 
 observer events need and already process a chain of events? Am i missing 
 something about your argument?

I'm saying that there's reasons to not use notifiers passive or active.

Mostly the whole notifier/indirection muck comes up once you want
modules to make use of the thing, because then you need dynamic
management of the callback list.

(Then again, I'm fairly glad we don't have explicit callbacks in
kernel/cpu.c for all the cpu-hotplug callbacks :-)

Anyway, I oppose for the existing events to gain an active role.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Ingo Molnar

* Peter Zijlstra pet...@infradead.org wrote:

 Cut the microblaze list since its bouncy.
 
 On Fri, 2011-05-13 at 15:18 +0200, Ingo Molnar wrote:
  * Peter Zijlstra pet...@infradead.org wrote:
  
   On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote:
I think the sanest semantics is to run all active callbacks as well.

For example if this is used for three stacked security policies - as if 
3 LSM 
modules were stacked at once. We'd call all three, and we'd determine 
that at 
least one failed - and we'd return a failure. 
   
   But that only works for boolean functions where you can return the
   multi-bit-or of the result. What if you need to return the specific
   error code.
  
  Do you mean that one filter returns -EINVAL while the other -EACCES?
  
  Seems like a non-problem to me, we'd return the first nonzero value.
 
 Assuming the first is -EINVAL, what then is the value in computing the
 -EACCESS? Sounds like a massive waste of time to me.

No, because the common case is no rejection - this is a security mechanism. So 
in the normal case we would execute all 3 anyway, just to determine that all 
return 0.

Are you really worried about the abnormal case of one of them returning an 
error and us calculating all 3 return values?

   Also, there's bound to be other cases where people will want to employ
   this, look at all the various notifier chain muck we've got, it already
   deals with much of this -- simply because users need it.
  
  Do you mean it would be easy to abuse it? What kind of abuse are you most 
  worried about?
 
 I'm not worried about abuse, I'm saying that going by the existing
 notifier pattern always visiting all entries on the callback list is
 undesired.

That is because many notifier chains are used in an 'event consuming' manner - 
they are responding to things like hardware events and are called in an 
interrupt-handler alike fashion most of the time.

   Then there's the whole indirection argument, if you don't need
   indirection, its often better to not use it, I myself much prefer code
   to look like:
   
  foo1(bar);
  foo2(bar);
  foo3(bar);
   
   Than:
   
  foo_notifier(bar);
   
   Simply because its much clearer who all are involved without me having
   to grep around to see who registers for foo_notifier and wth they do
   with it. It also makes it much harder to sneak in another user, whereas
   its nearly impossible to find new notifier users.
   
   Its also much faster, no extra memory accesses, no indirect function
   calls, no other muck.
  
  But i suspect this question has been settled, given the fact that even pure 
  observer events need and already process a chain of events? Am i missing 
  something about your argument?
 
 I'm saying that there's reasons to not use notifiers passive or active.
 
 Mostly the whole notifier/indirection muck comes up once you want
 modules to make use of the thing, because then you need dynamic
 management of the callback list.

But your argument assumes that we'd have a chain of functions to call, like 
regular notifiers.

While the natural model here would be to have a list of registered event 
structs for that point, with different filters but basically the same callback 
mechanism (a call into the filter engine in essence).

Also note that the common case would be no event registered - and we'd 
automatically optimize that case via the existing jump labels optimization.

 (Then again, I'm fairly glad we don't have explicit callbacks in kernel/cpu.c 
 for all the cpu-hotplug callbacks :-)
 
 Anyway, I oppose for the existing events to gain an active role.

Why if 'being active' is optional and useful?

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Ingo Molnar

* Peter Zijlstra pet...@infradead.org wrote:

 On Fri, 2011-05-13 at 14:49 +0200, Ingo Molnar wrote:
  
  So given that by your own admission it makes sense to share the facilities 
  at 
  the low level, i also argue that it makes sense to share as high up as 
  possible. 
 
 I'm not saying any such thing, I'm saying that it might make sense to
 observe active objects and auto-create these observation points. That
 doesn't make them similar or make them share anything.

Well, they would share the lowest level call site:

result = check_event_vfs_getname(result);

You call it 'auto-generated call site', i call it a shared (single line) call 
site. The same thing as far as the lowest level goes.

Now (the way i understood it) you'd want to stop the sharing right after that. 
I argue that it should go all the way up.

Note: i fully agree that there should be events where filters can have no 
effect whatsoever. For example if this was written as:

check_event_vfs_getname(result);

Then it would have no effect. This is decided by the subsystem developers, 
obviously. So whether an event is 'active' or 'passive' can be enforced at the 
subsystem level as well.

As far as the event facilities go, 'no effect observation' is a special-case of 
'active observation' - just like read-only files are a special case of 
read-write files.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Eric Paris
[dropping microblaze and roland]

lOn Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote:
 * James Morris jmor...@namei.org wrote:

 It is a simple and sensible security feature, agreed? It allows most code to 
 run well and link to countless libraries - but no access to other files is 
 allowed.

It's simple enough and sounds reasonable, but you can read all the
discussion about AppArmour why many people don't really think it's the
best.  Still, I'll agree it's a lot better than nothing.

 But if i had a VFS event at the fs/namei.c::getname() level, i would have 
 access to a central point where the VFS string becomes stable to the kernel 
 and 
 can be checked (and denied if necessary).
 
 A sidenote, and not surprisingly, the audit subsystem already has an event 
 callback there:
 
 audit_getname(result);
 
 Unfortunately this audit callback cannot be used for my purposes, because the 
 event is single-purpose for auditd and because it allows no feedback (no 
 deny/accept discretion for the security policy).
 
 But if had this simple event there:
 
   err = event_vfs_getname(result);

Wow it sounds so easy.  Now lets keep extending your train of thought
until we can actually provide the security provided by SELinux.  What do
we end up with?  We end up with an event hook right next to every LSM
hook.  You know, the LSM hooks were placed where they are for a reason.
Because those were the locations inside the kernel where you actually
have information about the task doing an operation and the objects
(files, sockets, directories, other tasks, etc) they are doing an
operation on.

Honestly all you are talking about it remaking the LSM with 2 sets of
hooks instead if 1.  Why?  It seems much easier that if you want the
language of the filter engine you would just make a new LSM that uses
the filter engine for it's policy language rather than the language
created by SELinux or SMACK or name your LSM implementation.

  - unprivileged:  application-definable, allowing the embedding of security 
   policy in *apps* as well, not just the system
 
  - flexible:  can be added/removed runtime unprivileged, and cheaply so
 
  - transparent:   does not impact executing code that meets the policy
 
  - nestable:  it is inherited by child tasks and is fundamentally 
 stackable,
   multiple policies will have the combined effect and they
   are transparent to each other. So if a child task within a
   sandbox adds *more* checks then those add to the already
   existing set of checks. We only narrow permissions, never
   extend them.
 
  - generic:   allowing observation and (safe) control of security relevant
   parameters not just at the system call boundary but at other
   relevant places of kernel execution as well: which 
   points/callbacks could also be used for other types of 
 event 
   extraction such as perf. It could even be shared with audit 
 ...

I'm not arguing that any of these things are bad things.  What you
describe is a new LSM that uses a discretionary access control model but
with the granularity and flexibility that has traditionally only existed
in the mandatory access control security modules previously implemented
in the kernel.

I won't argue that's a bad idea, there's no reason in my mind that a
process shouldn't be allowed to control it's own access decisions in a
more flexible way than rwx bits.  Then again, I certainly don't see a
reason that this syscall hardening patch should be held up while a whole
new concept in computer security is contemplated...

-Eric

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Eric Paris
[dropping microblaze and roland]

On Fri, 2011-05-13 at 15:18 +0200, Ingo Molnar wrote:
 * Peter Zijlstra pet...@infradead.org wrote:
 
  On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote:
   I think the sanest semantics is to run all active callbacks as well.
   
   For example if this is used for three stacked security policies - as if 3 
   LSM 
   modules were stacked at once. We'd call all three, and we'd determine 
   that at 
   least one failed - and we'd return a failure. 
  
  But that only works for boolean functions where you can return the
  multi-bit-or of the result. What if you need to return the specific
  error code.
 
 Do you mean that one filter returns -EINVAL while the other -EACCES?
 
 Seems like a non-problem to me, we'd return the first nonzero value.

Sounds so easy!  Why haven't LSMs stacked already?  Because what happens
if one of these hooks did something stateful?  Lets say on open, hook #1
returns EPERM.  hook #2 allocates memory.  The open is going to fail and
hooks #2 is never going to get the close() which should have freed the
allocation.  If you can be completely stateless its easier, but there's
a reason that stacking security modules is hard.  Serge has tried in the
past and both dhowells and casey schaufler are working on it right now.
Stacking is never as easy as it sounds   :)

-Eric

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Peter Zijlstra
On Fri, 2011-05-13 at 11:10 -0400, Eric Paris wrote:
 Then again, I certainly don't see a
 reason that this syscall hardening patch should be held up while a whole
 new concept in computer security is contemplated... 

Which makes me wonder why this syscall hardening stuff is done outside
of LSM? Why isn't is part of the LSM so that say SELinux can have a
syscall bitmask per security context?

Making it part of the LSM also avoids having to add this prctl().


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Peter Zijlstra
On Fri, 2011-05-13 at 16:57 +0200, Ingo Molnar wrote:
 this is a security mechanism

Who says? and why would you want to unify two separate concepts only to
them limit it to security that just doesn't make sense.

Either you provide a full on replacement for notifier chain like things
or you don't, only extending trace events in this fashion for security
is like way weird.

Plus see the arguments Eric made about stacking stuff, not only security
schemes will have those problems.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Ingo Molnar

* James Morris jmor...@namei.org wrote:

 On Thu, 12 May 2011, Ingo Molnar wrote:
  Funnily enough, back then you wrote this:
  
 I'm concerned that we're seeing yet another security scheme being 
  designed on 
  the fly, without a well-formed threat model, and without taking into 
  account 
  lessons learned from the seemingly endless parade of similar, failed 
  schemes. 
  
  so when and how did your opinion of this scheme turn from it being an 
  endless parade of failed schemes to it being a well-defined and readily 
  understandable feature? :-)
 
 When it was defined in a way which limited its purpose to reducing the attack 
 surface of the sycall interface.

Let me outline a simple example of a new filter expression based security 
feature that could be implemented outside the narrow system call boundary you 
find acceptable, and please tell what is bad about it.

Say i'm a user-space sandbox developer who wants to enforce that sandboxed code 
should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/.

It is a simple and sensible security feature, agreed? It allows most code to 
run well and link to countless libraries - but no access to other files is 
allowed.

I would also like my sandbox app to be able to install this policy without 
having to be root. I do not want the sandbox app to have permission to create 
labels on /lib and /usr/lib and what not.

Firstly, using the filter code i deny the various link creation syscalls so 
that sandboxed code cannot escape for example by creating a symlink to outside 
the permitted VFS namespace. (Note: we opt-in to syscalls, that way new 
syscalls added by new kernels are denied by defalt. The current symlink 
creation syscalls are not opted in to.)

But the next step, actually checking filenames, poses a big hurdle: i cannot 
implement the filename checking at the sys_open() syscall level in a secure 
way: because the pathname is passed to sys_open() by pointer, and if i check it 
at the generic sys_open() syscall level, another thread in the sandbox might 
modify the underlying filename *after* i've checked it.

But if i had a VFS event at the fs/namei.c::getname() level, i would have 
access to a central point where the VFS string becomes stable to the kernel and 
can be checked (and denied if necessary).

A sidenote, and not surprisingly, the audit subsystem already has an event 
callback there:

audit_getname(result);

Unfortunately this audit callback cannot be used for my purposes, because the 
event is single-purpose for auditd and because it allows no feedback (no 
deny/accept discretion for the security policy).

But if had this simple event there:

err = event_vfs_getname(result);

I could implement this new filename based sandboxing policy, using a filter 
like this installed on the vfs::getname event and inherited by all sandboxed 
tasks (which cannot uninstall the filter, obviously):

  
if (strstr(name, ..))
return -EACCESS;

if (!strncmp(name, /home/sandbox/, 14) 
!strncmp(name, /lib/, 5) 
!strncmp(name, /usr/lib/, 9))
return -EACCESS;

  

  #
  # Note1: Obviously the filter engine would be extended to allow such simple 
string
  #match functions. )
  #
  # Note2: .. is disallowed so that sandboxed code cannot escape the 
restrictions
  # using /...
  #

This kind of flexible and dynamic sandboxing would allow a wide range of file 
ops within the sandbox, while still isolating it from files not included in the 
specified VFS namespace.

( Note that there are tons of other examples as well, for useful security 
features
  that are best done using events outside the syscall boundary. )

The security event filters code tied to seccomp and syscalls at the moment is 
useful, but limited in its future potential.

So i argue that it should go slightly further and should become:

 - unprivileged:  application-definable, allowing the embedding of security 
  policy in *apps* as well, not just the system

 - flexible:  can be added/removed runtime unprivileged, and cheaply so

 - transparent:   does not impact executing code that meets the policy

 - nestable:  it is inherited by child tasks and is fundamentally stackable,
  multiple policies will have the combined effect and they
  are transparent to each other. So if a child task within a
  sandbox adds *more* checks then those add to the already
  existing set of checks. We only narrow permissions, never
  extend them.

 - generic:   allowing observation and (safe) control of security relevant
  parameters not just at the system call boundary but at other
  relevant places of kernel execution as well: which 
  points/callbacks could also be used for other types of event 
  extraction such 

Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Eric Paris
On Fri, 2011-05-13 at 17:23 +0200, Peter Zijlstra wrote:
 On Fri, 2011-05-13 at 11:10 -0400, Eric Paris wrote:
  Then again, I certainly don't see a
  reason that this syscall hardening patch should be held up while a whole
  new concept in computer security is contemplated... 
 
 Which makes me wonder why this syscall hardening stuff is done outside
 of LSM? Why isn't is part of the LSM so that say SELinux can have a
 syscall bitmask per security context?

I could do that, but I like Will's approach better.  From the PoV of
meeting security goals of information flow, data confidentiality,
integrity, least priv, etc limiting on the syscall boundary doesn't make
a lot of sense.  You just don't know enough there to enforce these
things.  These are the types of goals that SELinux and other LSMs have
previously tried to enforce.  From the PoV of making the kernel more
resistant to attacks and making a process more resistant to misbehavior
I think that the syscall boundary is appropriate.  Although I could do
it in SELinux it don't really want to do it there.

In case people are interested or confused let me give my definition of
two words I've used a bit in these conversations: discretionary and
mandatory.  Any time I talk about a 'discretionary' security decision it
is a security decisions that a process imposed upon itself.  Aka the
choice to use seccomp is discretionary.  The choice to mark our own file
u-wx is discretionary.  This isn't the best definition but it's one that
works well in this discussion.  Mandatory security is one enforce by a
global policy.  It's what selinux is all about.  SELinux doesn't give
hoot what a process wants to do, it enforces a global policy from the
top down.  You take over a process, well, too bad, you still have no
choice but to follow the mandatory policy.

The LSM does NOT enforce a mandatory access control model, it's just how
it's been used in the past.  Ingo appears to me (please correct me if
I'm wrong) to really be a fan of exposing the flexibility of the LSM to
a discretionary access control model.  That doesn't seem like a bad
idea.  And maybe using the filter engine to define the language to do
this isn't a bad idea either.  But I think that's a 'down the road'
project, not something to hold up a better seccomp.

 Making it part of the LSM also avoids having to add this prctl().

Well, it would mean exposing some new language construct to every LSM
(instead of a single prctl construct) and it would mean anyone wanting
to use the interface would have to rely on the LSM implementing those
hooks the way they need it.  Honestly chrome can already get all of the
benefits of this patch (given a perfectly coded kernel) and a whole lot
more using SELinux, but (surprise surprise) not everyone uses SELinux.
I think it's a good idea to expose a simple interface which will be
widely enough adopted that many userspace applications can rely on it
for hardening.

The existence of the LSM and the fact that there exists multiple
security modules that may or may not be enabled really leads application
developers to be unable to rely on LSM for security.  If linux had a
single security model which everyone could rely on we wouldn't really
have as big of an issue but that's not possible.  So I'm advocating for
this series which will provide a single useful change which applications
can rely upon across distros and platforms to enhance the properties and
abilities of the linux kernel.

-Eric

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-13 Thread Will Drewry
On Fri, May 13, 2011 at 10:55 AM, Eric Paris epa...@redhat.com wrote:
 On Fri, 2011-05-13 at 17:23 +0200, Peter Zijlstra wrote:
 On Fri, 2011-05-13 at 11:10 -0400, Eric Paris wrote:
  Then again, I certainly don't see a
  reason that this syscall hardening patch should be held up while a whole
  new concept in computer security is contemplated...

 Which makes me wonder why this syscall hardening stuff is done outside
 of LSM? Why isn't is part of the LSM so that say SELinux can have a
 syscall bitmask per security context?

 I could do that, but I like Will's approach better.  From the PoV of
 meeting security goals of information flow, data confidentiality,
 integrity, least priv, etc limiting on the syscall boundary doesn't make
 a lot of sense.  You just don't know enough there to enforce these
 things.  These are the types of goals that SELinux and other LSMs have
 previously tried to enforce.  From the PoV of making the kernel more
 resistant to attacks and making a process more resistant to misbehavior
 I think that the syscall boundary is appropriate.  Although I could do
 it in SELinux it don't really want to do it there.

There's also the problem that there are no hooks per-system call for
LSMs, only logical hooks that sometimes mirror system call names and
are called after user data has been parsed.  If system call enter
hooks, like seccomp's, were added for LSMs, it would allow the lsm
bitmask approach, but it still wouldn't satisfy the issues you raise
below (and I wholeheartedly agree with).

 In case people are interested or confused let me give my definition of
 two words I've used a bit in these conversations: discretionary and
 mandatory.  Any time I talk about a 'discretionary' security decision it
 is a security decisions that a process imposed upon itself.  Aka the
 choice to use seccomp is discretionary.  The choice to mark our own file
 u-wx is discretionary.  This isn't the best definition but it's one that
 works well in this discussion.  Mandatory security is one enforce by a
 global policy.  It's what selinux is all about.  SELinux doesn't give
 hoot what a process wants to do, it enforces a global policy from the
 top down.  You take over a process, well, too bad, you still have no
 choice but to follow the mandatory policy.

 The LSM does NOT enforce a mandatory access control model, it's just how
 it's been used in the past.  Ingo appears to me (please correct me if
 I'm wrong) to really be a fan of exposing the flexibility of the LSM to
 a discretionary access control model.  That doesn't seem like a bad
 idea.  And maybe using the filter engine to define the language to do
 this isn't a bad idea either.  But I think that's a 'down the road'
 project, not something to hold up a better seccomp.

 Making it part of the LSM also avoids having to add this prctl().

 Well, it would mean exposing some new language construct to every LSM
 (instead of a single prctl construct) and it would mean anyone wanting
 to use the interface would have to rely on the LSM implementing those
 hooks the way they need it.  Honestly chrome can already get all of the
 benefits of this patch (given a perfectly coded kernel) and a whole lot
 more using SELinux, but (surprise surprise) not everyone uses SELinux.
 I think it's a good idea to expose a simple interface which will be
 widely enough adopted that many userspace applications can rely on it
 for hardening.

 The existence of the LSM and the fact that there exists multiple
 security modules that may or may not be enabled really leads application
 developers to be unable to rely on LSM for security.  If linux had a
 single security model which everyone could rely on we wouldn't really
 have as big of an issue but that's not possible.  So I'm advocating for
 this series which will provide a single useful change which applications
 can rely upon across distros and platforms to enhance the properties and
 abilities of the linux kernel.

 -Eric


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-12 Thread Ingo Molnar

Ok, i like the direction here, but i think the ABI should be done differently.

In this patch the ftrace event filter mechanism is used:

* Will Drewry w...@chromium.org wrote:

 +static struct seccomp_filter *alloc_seccomp_filter(int syscall_nr,
 +const char *filter_string)
 +{
 + int err = -ENOMEM;
 + struct seccomp_filter *filter = kzalloc(sizeof(struct seccomp_filter),
 + GFP_KERNEL);
 + if (!filter)
 + goto fail;
 +
 + INIT_HLIST_NODE(filter-node);
 + filter-syscall_nr = syscall_nr;
 + filter-data = syscall_nr_to_meta(syscall_nr);
 +
 + /* Treat a filter of SECCOMP_WILDCARD_FILTER as a wildcard and skip
 +  * using a predicate at all.
 +  */
 + if (!strcmp(SECCOMP_WILDCARD_FILTER, filter_string))
 + goto out;
 +
 + /* Argument-based filtering only works on ftrace-hooked syscalls. */
 + if (!filter-data) {
 + err = -ENOSYS;
 + goto fail;
 + }
 +
 +#ifdef CONFIG_FTRACE_SYSCALLS
 + err = ftrace_parse_filter(filter-event_filter,
 +   filter-data-enter_event-event.type,
 +   filter_string);
 + if (err)
 + goto fail;
 +#endif
 +
 +out:
 + return filter;
 +
 +fail:
 + kfree(filter);
 + return ERR_PTR(err);
 +}

Via a prctl() ABI:

 --- a/kernel/sys.c
 +++ b/kernel/sys.c
 @@ -1698,12 +1698,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
 arg2, unsigned long, arg3,
   case PR_SET_ENDIAN:
   error = SET_ENDIAN(me, arg2);
   break;
 -
   case PR_GET_SECCOMP:
   error = prctl_get_seccomp();
   break;
   case PR_SET_SECCOMP:
 - error = prctl_set_seccomp(arg2);
 + error = prctl_set_seccomp(arg2, arg3);
 + break;
 + case PR_SET_SECCOMP_FILTER:
 + error = prctl_set_seccomp_filter(arg2,
 +  (char __user *) arg3);
 + break;
 + case PR_CLEAR_SECCOMP_FILTER:
 + error = prctl_clear_seccomp_filter(arg2);
 + break;
 + case PR_GET_SECCOMP_FILTER:
 + error = prctl_get_seccomp_filter(arg2,
 +  (char __user *) arg3,
 +  arg4);

To restrict execution to system calls.

Two observations:

1) We already have a specific ABI for this: you can set filters for events via 
   an event fd.

   Why not extend that mechanism instead and improve *both* your sandboxing
   bits and the events code? This new seccomp code has a lot more
   to do with trace event filters than the minimal old seccomp code ...

   kernel/trace/trace_event_filter.c is 2000 lines of tricky code that
   interprets the ASCII filter expressions. kernel/seccomp.c is 86 lines of
   mostly trivial code.

2) Why should this concept not be made available wider, to allow the 
   restriction of not just system calls but other security relevant components 
   of the kernel as well?

   This too, if you approach the problem via the events code, will be a natural 
   end result, while if you approach it from the seccomp prctl angle it will be
   a limited hack only.

Note, the end result will be the same - just using a different ABI.

So i really think the ABI itself should be closer related to the event code. 
What this seccomp code does is that it uses specific syscall events to 
restrict execution of certain event generating codepaths, such as system calls.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-12 Thread Kees Cook
Hi,

On Thu, May 12, 2011 at 09:48:50AM +0200, Ingo Molnar wrote:
 1) We already have a specific ABI for this: you can set filters for events 
 via 
an event fd.
 
Why not extend that mechanism instead and improve *both* your sandboxing
bits and the events code? This new seccomp code has a lot more
to do with trace event filters than the minimal old seccomp code ...

Would this require privileges to get the event fd to start with? If so,
I would prefer to avoid that, since using prctl() as shown in the patch
set won't require any privs.

-Kees

-- 
Kees Cook
Ubuntu Security Team
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-12 Thread Ingo Molnar

* Kees Cook kees.c...@canonical.com wrote:

 Hi,
 
 On Thu, May 12, 2011 at 09:48:50AM +0200, Ingo Molnar wrote:
  1) We already have a specific ABI for this: you can set filters for events 
  via 
 an event fd.
  
 Why not extend that mechanism instead and improve *both* your sandboxing
 bits and the events code? This new seccomp code has a lot more
 to do with trace event filters than the minimal old seccomp code ...
 
 Would this require privileges to get the event fd to start with? [...]

No special privileges with the default perf_events_paranoid value.

 [...] If so, I would prefer to avoid that, since using prctl() as shown in 
 the patch set won't require any privs.

and we could also explicitly allow syscall events without any privileges, 
regardless of the setting of 'perf_events_paranoid' config value.

Obviously a sandboxing host process wants to run with as low privileges as it 
can.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-12 Thread James Morris
On Wed, 11 May 2011, Will Drewry wrote:

 +void seccomp_filter_log_failure(int syscall)
 +{
 + printk(KERN_INFO
 + %s[%d]: system call %d (%s) blocked at ip:%lx\n,
 + current-comm, task_pid_nr(current), syscall,
 + syscall_nr_to_name(syscall), KSTK_EIP(current));
 +}

I think it'd be a good idea to utilize the audit facility here.


- James
-- 
James Morris
jmor...@namei.org
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-12 Thread James Morris
On Thu, 12 May 2011, Ingo Molnar wrote:

 
 2) Why should this concept not be made available wider, to allow the 
restriction of not just system calls but other security relevant 
 components 
of the kernel as well?

Because the aim of this is to reduce the attack surface of the syscall 
interface.

LSM is the correct level of abstraction for general security mediation, 
because it allows you to take into account all relevant security 
information in a race-free context.


This too, if you approach the problem via the events code, will be a 
 natural 
end result, while if you approach it from the seccomp prctl angle it will 
 be
a limited hack only.

I'd say it's a well-defined and readily understandable feature.


- James
-- 
James Morris
jmor...@namei.org
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-12 Thread Frederic Weisbecker
On Thu, May 12, 2011 at 09:48:50AM +0200, Ingo Molnar wrote:
 To restrict execution to system calls.
 
 Two observations:
 
 1) We already have a specific ABI for this: you can set filters for events 
 via 
an event fd.
 
Why not extend that mechanism instead and improve *both* your sandboxing
bits and the events code? This new seccomp code has a lot more
to do with trace event filters than the minimal old seccomp code ...
 
kernel/trace/trace_event_filter.c is 2000 lines of tricky code that
interprets the ASCII filter expressions. kernel/seccomp.c is 86 lines of
mostly trivial code.
 
 2) Why should this concept not be made available wider, to allow the 
restriction of not just system calls but other security relevant 
 components 
of the kernel as well?
 
This too, if you approach the problem via the events code, will be a 
 natural 
end result, while if you approach it from the seccomp prctl angle it will 
 be
a limited hack only.
 
 Note, the end result will be the same - just using a different ABI.
 
 So i really think the ABI itself should be closer related to the event code. 
 What this seccomp code does is that it uses specific syscall events to 
 restrict execution of certain event generating codepaths, such as system 
 calls.
 
 Thanks,
 
   Ingo

What's positive with that approach is that the code is all there already.
Create a perf event for a given trace event, attach a filter to it.

What needs to be added is an override of the effect of the filter. By default
it's dropping the event, but there may be different flavours, including sending
a signal. All in one, extending the current code to allow that looks trivial.

The negative points are that

* trace events are supposed to stay passive and not act on the system, except
doing some endpoint things like writing to a buffer. We can't call do_exit()
from a tracepoint for example, preemption is disabled there.

* Also, is it actually relevant to extend that seccomp filtering to other events
than syscalls? Exposing kernel events to filtering sounds actually to me 
bringing
a new potential security issue. But with fine restrictions this can probably
be dealt with. Especially if by default only syscalls can be filtered

* I think Peter did not want to give such active role to perf in the system.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-12 Thread Ingo Molnar

* James Morris jmor...@namei.org wrote:

 On Thu, 12 May 2011, Ingo Molnar wrote:
 
  2) Why should this concept not be made available wider, to allow the 
 restriction of not just system calls but other security relevant 
  components 
 of the kernel as well?
 
 Because the aim of this is to reduce the attack surface of the syscall 
 interface.

What i suggest achieves the same, my argument is that we could aim it to be 
even more flexible and even more useful.

 LSM is the correct level of abstraction for general security mediation, 
 because it allows you to take into account all relevant security information 
 in a race-free context.

I don't care about LSM though, i find it poorly designed.

The approach implemented here, the ability for *unprivileged code* to define 
(the seeds of ...) flexible security policies, in a proper Linuxish way, which 
is inherited along the task parent/child hieararchy and which allows nesting 
etc. is a *lot* more flexible.

What Will implemented here is pretty huge in my opinion: it turns security from 
a root-only kind of weird hack into an essential component of its APIs, 
available to *any* app not just the select security policy/mechanism chosen by 
the distributor ...

If implemented properly this could replace LSM in the long run.

As a prctl() hack bound to seccomp (which, by all means, is a natural extension 
to the current seccomp ABI, so perfectly fine if we only want that scope), that 
is much less likely to happen.

And if we merge the seccomp interface prematurely then interest towards a more 
flexible approach will disappear, so either we do it properly now or it will 
take some time for someone to come around and do it ...

Also note that i do not consider the perf events ABI itself cast into stone - 
and we could very well add a new system call for this, independent of perf 
events. I just think that the seccomp scope itself is exciting but looks 
limited to what the real potential of this could be.

 This too, if you approach the problem via the events code, will be a 
  natural 
 end result, while if you approach it from the seccomp prctl angle it 
  will be
 a limited hack only.
 
 I'd say it's a well-defined and readily understandable feature.

Note, it was me who suggested this very event-filter-engine design a year ago, 
when the first submission still used a crude bitmap of allowed seccomp 
syscalls:

  http://lwn.net/Articles/332974/

Funnily enough, back then you wrote this:

   I'm concerned that we're seeing yet another security scheme being designed 
on 
the fly, without a well-formed threat model, and without taking into 
account 
lessons learned from the seemingly endless parade of similar, failed 
schemes. 

so when and how did your opinion of this scheme turn from it being an endless 
parade of failed schemes to it being a well-defined and readily 
understandable feature? :-)

The idea itself has not changed since last year, what happened is that the 
filter engine got a couple of new features and Will has separated it out and 
has implemented a working prototype for sandboxing.

What i do here is to suggest *further* steps down the same road, now that we 
see that this scheme can indeed be used to implement sandboxing ... I think 
it's a valid line of inquiry.

Thanks,

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-12 Thread Will Drewry
[Thanks to everyone for the continued feedback and insights - I appreciate it!]

On Thu, May 12, 2011 at 8:01 AM, Ingo Molnar mi...@elte.hu wrote:

 * James Morris jmor...@namei.org wrote:

 On Thu, 12 May 2011, Ingo Molnar wrote:

  2) Why should this concept not be made available wider, to allow the
     restriction of not just system calls but other security relevant 
  components
     of the kernel as well?

 Because the aim of this is to reduce the attack surface of the syscall
 interface.

 What i suggest achieves the same, my argument is that we could aim it to be
 even more flexible and even more useful.

 LSM is the correct level of abstraction for general security mediation,
 because it allows you to take into account all relevant security information
 in a race-free context.

 I don't care about LSM though, i find it poorly designed.

 The approach implemented here, the ability for *unprivileged code* to define
 (the seeds of ...) flexible security policies, in a proper Linuxish way, which
 is inherited along the task parent/child hieararchy and which allows nesting
 etc. is a *lot* more flexible.

 What Will implemented here is pretty huge in my opinion: it turns security 
 from
 a root-only kind of weird hack into an essential component of its APIs,
 available to *any* app not just the select security policy/mechanism chosen by
 the distributor ...

 If implemented properly this could replace LSM in the long run.

 As a prctl() hack bound to seccomp (which, by all means, is a natural 
 extension
 to the current seccomp ABI, so perfectly fine if we only want that scope), 
 that
 is much less likely to happen.

 And if we merge the seccomp interface prematurely then interest towards a more
 flexible approach will disappear, so either we do it properly now or it will
 take some time for someone to come around and do it ...

 Also note that i do not consider the perf events ABI itself cast into stone -
 and we could very well add a new system call for this, independent of perf
 events. I just think that the seccomp scope itself is exciting but looks
 limited to what the real potential of this could be.

I agree with you on many of these points!  However, I don't think that
the views around LSMs, perf/ftrace infrastructure, or the current
seccomp filtering implementation are necessarily in conflict.  Here is
my understanding of how the different worlds fit together and where I
see this patchset living, along with where I could see future work
going.  Perhaps I'm being a trifle naive, but here goes anyway:

1. LSMs provide a global mechanism for hooking security relevant
events at a point where all the incoming user-sourced data has been
preprocessed and moved into userspace.  The hooks are called every
time one of those boundaries are crossed.
2. Perf and the ftrace infrastructure provide global function tracing
and system call hooks with direct access to the caller's registers
(and memory).
3. seccomp (as it exists today) provides a global system call entry
hook point with a binary per-process decision about whether to provide
secure computing behavior.

When I boil that down to abstractions, I see:
A. Globally scoped: LSMs, ftrace/perf
B. Locally/process scoped: seccomp

The result of that logical equivalence is that I see room for:
I. A per-process, locally scoped security event hooking interface (the
proposed changes in this patchset)
II. A globally scoped security event hooking interface _prior_ to
argument processing
III. A globally scoped security event hooking interface _post_
argument processing

II and III could be reduced further if I assume that ftrace/perf
provides (II) and a simple intermediary layer (hook entry/exit)
provides the argument processing steps that then call out a global
security policy system.

The driving motivation for this patchset is kernel attack surface
reduction, but that need arises because we lack a process-scoped
mechanism for making security decisions -- everything is global:
creds/DAC, containers, LSM, etc.   Adding ftrace filtering to agl's
original bitmask-seccomp proposal opens up the process-local security
world.  At present, it can limit the attack surface with simple binary
filters or apply limited security policy through the use of filter
strings.

Based on your mails, I see two main deficiencies in my proposed patchset:
a. Deep argument analysis: Any arguments that live in user memory
needs to be copied into the kernel, then checked, and substituted for
the actual system call, then have the original pointers restored (when
applicable) on system call exit.  There is a large overhead here and
the LSM hooks provide much of this support on a global level.
b. Lack of support for non-system call events.

For (a), if the long term view of ftrace/perf  LSMs is that LSM-like
functionality will live on top of the ftrace/perf infrastructure, then
adding support for the intermediary layer to analyze arguments will
come with time.  It's also likely that for 

Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

2011-05-12 Thread James Morris
On Thu, 12 May 2011, Ingo Molnar wrote:
 Funnily enough, back then you wrote this:
 
I'm concerned that we're seeing yet another security scheme being 
 designed on 
 the fly, without a well-formed threat model, and without taking into 
 account 
 lessons learned from the seemingly endless parade of similar, failed 
 schemes. 
 
 so when and how did your opinion of this scheme turn from it being an 
 endless 
 parade of failed schemes to it being a well-defined and readily 
 understandable feature? :-)

When it was defined in a way which limited its purpose to reducing the 
attack surface of the sycall interface.


- James
-- 
James Morris
jmor...@namei.org
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev