Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On 05/12/2015 06:23 PM, Andy Lutomirski wrote: On May 13, 2015 6:06 AM, "Chris Metcalf" wrote: On 05/11/2015 06:28 PM, Andy Lutomirski wrote: On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf wrote: In this case, killing the task is appropriate, since that's exactly the semantics that have been asked for - it's like on architectures that don't natively support unaligned accesses, but fake it relatively slowly in the kernel, and in development you just say "give me a SIGBUS when that happens" and in production you might say "fix it up and let's try to keep going". I think more control is needed. I also think that, if we go this route, we should distinguish syscalls, synchronous non-syscall entries, and asynchronous non-syscall entries. They're quite different. I don't think it's necessary to distinguish the types. As long as we have a PC pointing to the instruction that triggered the problem, we can see if it's a system call instruction, a memory write that caused a page fault, a trap instruction, etc. Not true. PC right after a syscall insn could be any type of kernel entry, and you can't even reliably tell whether the syscall insn was executed or, on x86, whether it was a syscall at all. (x86 insns can't be reliably decided backwards.) PC pointing at a load could be a page fault or an IPI. All that we are trying to do with this API, though, is distinguish synchronous faults. So IPIs, etc., should not be happening (they would be bugs), and hopefully we are mostly just distinguishing different types of synchronous program entries. That said, I did a si_info flag to differentiate syscalls from other synchronous entries, and I'm open to looking at more such if it seems useful. Again, though, I think we really do need to distinguish at least MCE and NMI (on x86) from the others. Yes, those are both interesting cases, and I'm not entirely sure what the right way to handle them is - for example, likely disable STRICT if you are running with perf enabled. I look forward to hearing more when you're back next week! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On 05/12/2015 06:23 PM, Andy Lutomirski wrote: On May 13, 2015 6:06 AM, Chris Metcalf cmetc...@ezchip.com wrote: On 05/11/2015 06:28 PM, Andy Lutomirski wrote: On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf cmetc...@ezchip.com wrote: In this case, killing the task is appropriate, since that's exactly the semantics that have been asked for - it's like on architectures that don't natively support unaligned accesses, but fake it relatively slowly in the kernel, and in development you just say give me a SIGBUS when that happens and in production you might say fix it up and let's try to keep going. I think more control is needed. I also think that, if we go this route, we should distinguish syscalls, synchronous non-syscall entries, and asynchronous non-syscall entries. They're quite different. I don't think it's necessary to distinguish the types. As long as we have a PC pointing to the instruction that triggered the problem, we can see if it's a system call instruction, a memory write that caused a page fault, a trap instruction, etc. Not true. PC right after a syscall insn could be any type of kernel entry, and you can't even reliably tell whether the syscall insn was executed or, on x86, whether it was a syscall at all. (x86 insns can't be reliably decided backwards.) PC pointing at a load could be a page fault or an IPI. All that we are trying to do with this API, though, is distinguish synchronous faults. So IPIs, etc., should not be happening (they would be bugs), and hopefully we are mostly just distinguishing different types of synchronous program entries. That said, I did a si_info flag to differentiate syscalls from other synchronous entries, and I'm open to looking at more such if it seems useful. Again, though, I think we really do need to distinguish at least MCE and NMI (on x86) from the others. Yes, those are both interesting cases, and I'm not entirely sure what the right way to handle them is - for example, likely disable STRICT if you are running with perf enabled. I look forward to hearing more when you're back next week! -- Chris Metcalf, EZChip Semiconductor http://www.ezchip.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On May 13, 2015 6:06 AM, "Chris Metcalf" wrote: > > On 05/11/2015 06:28 PM, Andy Lutomirski wrote: >> >> [add peterz due to perf stuff] >> >> On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf wrote: >>> >>> Patch 6/6 proposes a mechanism to track down times when the >>> kernel screws up and delivers an IRQ to a userspace-only task. >>> Here, we're just trying to identify the times when an application >>> screws itself up out of cluelessness, and provide a mechanism >>> that allows the developer to easily figure out why and fix it. >>> >>> In particular, /proc/interrupts won't show syscalls or page faults, >>> which are two easy ways applications can screw themselves >>> when they think they're in userspace-only mode. Also, they don't >>> provide sufficient precision to make it clear what part of the >>> application caused the undesired kernel entry. >> >> Perf does, though, complete with context. > > > The perf_event suggestions are interesting, but I think it's plausible > for this to be an alternate way to debug the issues that STRICT > addresses. > > >>> In this case, killing the task is appropriate, since that's exactly >>> the semantics that have been asked for - it's like on architectures >>> that don't natively support unaligned accesses, but fake it relatively >>> slowly in the kernel, and in development you just say "give me a >>> SIGBUS when that happens" and in production you might say >>> "fix it up and let's try to keep going". >> >> I think more control is needed. I also think that, if we go this >> route, we should distinguish syscalls, synchronous non-syscall >> entries, and asynchronous non-syscall entries. They're quite >> different. > > > I don't think it's necessary to distinguish the types. As long as we > have a PC pointing to the instruction that triggered the problem, > we can see if it's a system call instruction, a memory write that > caused a page fault, a trap instruction, etc. Not true. PC right after a syscall insn could be any type of kernel entry, and you can't even reliably tell whether the syscall insn was executed or, on x86, whether it was a syscall at all. (x86 insns can't be reliably decided backwards.) PC pointing at a load could be a page fault or an IPI. > We certainly could > add infrastructure to capture syscall numbers, fault/signal numbers, > etc etc, but I think it's overkill if it adds kernel overhead on > entry/exit. > None of these should add overhead. > >>> A better implementation, I think, is to put the tests for "you >>> screwed up and synchronously entered the kernel" in >>> the syscall_trace_enter() code, which TIF_NOHZ already >>> gets us into; >> >> No, not unless you're planning on using that to distinguish syscalls >> from other stuff *and* people think that's justified. > > > So, the question is how we separate synchronous entries > from IRQs? At a high level, IRQs are kernel bugs (for cpu-isolated > tasks), and synchronous entries are application bugs. We'd > like to deliver a signal for the latter, and do some kind of > kernel diagnostics for the former. So we can't just add the > test in the context tracking code, which doesn't actually know > why we're entering or exiting. Synchronous entries could be VM bugs, too. > > That's why I was thinking that the syscall_trace_entry and > exception_enter paths were the best choices. I'm fairly sure > that exception_enter is only done for synchronous traps, > page faults, etc. Maybe. Doing it through the actual entry/exit slow paths would be overhead-free, although I'm not sure that IRQs have real slow paths for entry. > > Certainly on the tile architecture we include the trap number > in the pt_regs, so it's possible to just examine the pt_regs and > know why you entered or are exiting the kernel, but I don't > think we can rely on that for all architectures. x86 can't do this. > I'll put out a v2 of my patch that does both the things you > advise against :-) just so we can have a strawman to think > about how to do it better - unless you have a suggestion > offhand as to how we can better differentiate sync and async > entries into the kernel in a platform-independent way. > > I could imagine modifying user_exit() and exception_enter() > to pass an identifier into the context system saying why they > were changing contexts, so we could have syscalls, trap > numbers, fault numbers, etc., and some way to query as > to whether they were synchronous or asynchronous, and > build this scheme on top of that, but I'm not sure the extra > infrastructure is worthwhile. > I'll take a look. Again, though, I think we really do need to distinguish at least MCE and NMI (on x86) from the others. > >> What if we added a mode to perf where delivery of a sample >> synchronously (or semi-synchronously by catching it on the next exit >> to userspace) freezes the delivering task? It would be like debugger >> support via perf. >> >> peterz, do you think this would be a sensible thing to add to perf?
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On 05/11/2015 06:28 PM, Andy Lutomirski wrote: [add peterz due to perf stuff] On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf wrote: Patch 6/6 proposes a mechanism to track down times when the kernel screws up and delivers an IRQ to a userspace-only task. Here, we're just trying to identify the times when an application screws itself up out of cluelessness, and provide a mechanism that allows the developer to easily figure out why and fix it. In particular, /proc/interrupts won't show syscalls or page faults, which are two easy ways applications can screw themselves when they think they're in userspace-only mode. Also, they don't provide sufficient precision to make it clear what part of the application caused the undesired kernel entry. Perf does, though, complete with context. The perf_event suggestions are interesting, but I think it's plausible for this to be an alternate way to debug the issues that STRICT addresses. In this case, killing the task is appropriate, since that's exactly the semantics that have been asked for - it's like on architectures that don't natively support unaligned accesses, but fake it relatively slowly in the kernel, and in development you just say "give me a SIGBUS when that happens" and in production you might say "fix it up and let's try to keep going". I think more control is needed. I also think that, if we go this route, we should distinguish syscalls, synchronous non-syscall entries, and asynchronous non-syscall entries. They're quite different. I don't think it's necessary to distinguish the types. As long as we have a PC pointing to the instruction that triggered the problem, we can see if it's a system call instruction, a memory write that caused a page fault, a trap instruction, etc. We certainly could add infrastructure to capture syscall numbers, fault/signal numbers, etc etc, but I think it's overkill if it adds kernel overhead on entry/exit. A better implementation, I think, is to put the tests for "you screwed up and synchronously entered the kernel" in the syscall_trace_enter() code, which TIF_NOHZ already gets us into; No, not unless you're planning on using that to distinguish syscalls from other stuff *and* people think that's justified. So, the question is how we separate synchronous entries from IRQs? At a high level, IRQs are kernel bugs (for cpu-isolated tasks), and synchronous entries are application bugs. We'd like to deliver a signal for the latter, and do some kind of kernel diagnostics for the former. So we can't just add the test in the context tracking code, which doesn't actually know why we're entering or exiting. That's why I was thinking that the syscall_trace_entry and exception_enter paths were the best choices. I'm fairly sure that exception_enter is only done for synchronous traps, page faults, etc. Certainly on the tile architecture we include the trap number in the pt_regs, so it's possible to just examine the pt_regs and know why you entered or are exiting the kernel, but I don't think we can rely on that for all architectures. It's far to easy to just make a tiny change to the entry code. Add a tiny trivial change here, a few lines of asm (that's you, audit!) there, some weird written-in-asm scheduling code over here, and you end up with the truly awful mess that we currently have. If it really makes sense for this stuff to go with context tracking, then fine, but we should *fix* the context tracking first rather than kludging around it. I already have a prototype patch for the relevant part of that. there, we can test if the dataplane "strict" bit is set and the syscall is not prctl(), then we generate the error. (We'd exclude exit and exit_group here too, since we don't need to shoot down a task that's just trying to kill itself.) This needs a bit of platform-specific code for each platform, but that doesn't seem like too big a problem. I'd rather avoid that, too. This feature isn't really arch-specific, so let's avoid the arch stuff if at all possible. I'll put out a v2 of my patch that does both the things you advise against :-) just so we can have a strawman to think about how to do it better - unless you have a suggestion offhand as to how we can better differentiate sync and async entries into the kernel in a platform-independent way. I could imagine modifying user_exit() and exception_enter() to pass an identifier into the context system saying why they were changing contexts, so we could have syscalls, trap numbers, fault numbers, etc., and some way to query as to whether they were synchronous or asynchronous, and build this scheme on top of that, but I'm not sure the extra infrastructure is worthwhile. Likewise we can test in exception_enter() since that's only called for all the synchronous user entries like page faults. Let's try to generalize a bit. There's also irq_entry and ist_enter, and some of the exception_enter cases are for synchronous entries while (IIRC -- could be
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On Tue, May 12, 2015 at 11:38:58AM +0200, Peter Zijlstra wrote: > On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote: > > +++ b/kernel/time/tick-sched.c > > @@ -436,6 +436,20 @@ static void dataplane_quiesce(void) > > (jiffies - start)); > > dump_stack(); > > } > > + > > + /* > > +* Kill the process if it violates STRICT mode. Note that this > > +* code also results in killing the task if a kernel bug causes an > > +* irq to be delivered to this core. > > +*/ > > + if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL)) > > + == PR_DATAPLANE_STRICT) { > > + pr_warn("Dataplane STRICT mode violated; process killed.\n"); > > + dump_stack(); > > + task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE; > > + local_irq_enable(); > > + do_group_exit(SIGKILL); > > + } > > } > > So while I'm all for hard fails like this, can we not provide a wee bit > more information in the siginfo ? And maybe use a slightly less fatal > signal, such that userspace can actually catch it and dump state in > debug modes? Agreed, a bit more debug state would be helpful. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote: > +++ b/kernel/time/tick-sched.c > @@ -436,6 +436,20 @@ static void dataplane_quiesce(void) > (jiffies - start)); > dump_stack(); > } > + > + /* > + * Kill the process if it violates STRICT mode. Note that this > + * code also results in killing the task if a kernel bug causes an > + * irq to be delivered to this core. > + */ > + if ((task->dataplane_flags & (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL)) > + == PR_DATAPLANE_STRICT) { > + pr_warn("Dataplane STRICT mode violated; process killed.\n"); > + dump_stack(); > + task->dataplane_flags &= ~PR_DATAPLANE_QUIESCE; > + local_irq_enable(); > + do_group_exit(SIGKILL); > + } > } So while I'm all for hard fails like this, can we not provide a wee bit more information in the siginfo ? And maybe use a slightly less fatal signal, such that userspace can actually catch it and dump state in debug modes? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On 05/11/2015 06:28 PM, Andy Lutomirski wrote: [add peterz due to perf stuff] On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf cmetc...@ezchip.com wrote: Patch 6/6 proposes a mechanism to track down times when the kernel screws up and delivers an IRQ to a userspace-only task. Here, we're just trying to identify the times when an application screws itself up out of cluelessness, and provide a mechanism that allows the developer to easily figure out why and fix it. In particular, /proc/interrupts won't show syscalls or page faults, which are two easy ways applications can screw themselves when they think they're in userspace-only mode. Also, they don't provide sufficient precision to make it clear what part of the application caused the undesired kernel entry. Perf does, though, complete with context. The perf_event suggestions are interesting, but I think it's plausible for this to be an alternate way to debug the issues that STRICT addresses. In this case, killing the task is appropriate, since that's exactly the semantics that have been asked for - it's like on architectures that don't natively support unaligned accesses, but fake it relatively slowly in the kernel, and in development you just say give me a SIGBUS when that happens and in production you might say fix it up and let's try to keep going. I think more control is needed. I also think that, if we go this route, we should distinguish syscalls, synchronous non-syscall entries, and asynchronous non-syscall entries. They're quite different. I don't think it's necessary to distinguish the types. As long as we have a PC pointing to the instruction that triggered the problem, we can see if it's a system call instruction, a memory write that caused a page fault, a trap instruction, etc. We certainly could add infrastructure to capture syscall numbers, fault/signal numbers, etc etc, but I think it's overkill if it adds kernel overhead on entry/exit. A better implementation, I think, is to put the tests for you screwed up and synchronously entered the kernel in the syscall_trace_enter() code, which TIF_NOHZ already gets us into; No, not unless you're planning on using that to distinguish syscalls from other stuff *and* people think that's justified. So, the question is how we separate synchronous entries from IRQs? At a high level, IRQs are kernel bugs (for cpu-isolated tasks), and synchronous entries are application bugs. We'd like to deliver a signal for the latter, and do some kind of kernel diagnostics for the former. So we can't just add the test in the context tracking code, which doesn't actually know why we're entering or exiting. That's why I was thinking that the syscall_trace_entry and exception_enter paths were the best choices. I'm fairly sure that exception_enter is only done for synchronous traps, page faults, etc. Certainly on the tile architecture we include the trap number in the pt_regs, so it's possible to just examine the pt_regs and know why you entered or are exiting the kernel, but I don't think we can rely on that for all architectures. It's far to easy to just make a tiny change to the entry code. Add a tiny trivial change here, a few lines of asm (that's you, audit!) there, some weird written-in-asm scheduling code over here, and you end up with the truly awful mess that we currently have. If it really makes sense for this stuff to go with context tracking, then fine, but we should *fix* the context tracking first rather than kludging around it. I already have a prototype patch for the relevant part of that. there, we can test if the dataplane strict bit is set and the syscall is not prctl(), then we generate the error. (We'd exclude exit and exit_group here too, since we don't need to shoot down a task that's just trying to kill itself.) This needs a bit of platform-specific code for each platform, but that doesn't seem like too big a problem. I'd rather avoid that, too. This feature isn't really arch-specific, so let's avoid the arch stuff if at all possible. I'll put out a v2 of my patch that does both the things you advise against :-) just so we can have a strawman to think about how to do it better - unless you have a suggestion offhand as to how we can better differentiate sync and async entries into the kernel in a platform-independent way. I could imagine modifying user_exit() and exception_enter() to pass an identifier into the context system saying why they were changing contexts, so we could have syscalls, trap numbers, fault numbers, etc., and some way to query as to whether they were synchronous or asynchronous, and build this scheme on top of that, but I'm not sure the extra infrastructure is worthwhile. Likewise we can test in exception_enter() since that's only called for all the synchronous user entries like page faults. Let's try to generalize a bit. There's also irq_entry and ist_enter, and some of the exception_enter cases are for synchronous entries while (IIRC
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote: +++ b/kernel/time/tick-sched.c @@ -436,6 +436,20 @@ static void dataplane_quiesce(void) (jiffies - start)); dump_stack(); } + + /* + * Kill the process if it violates STRICT mode. Note that this + * code also results in killing the task if a kernel bug causes an + * irq to be delivered to this core. + */ + if ((task-dataplane_flags (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL)) + == PR_DATAPLANE_STRICT) { + pr_warn(Dataplane STRICT mode violated; process killed.\n); + dump_stack(); + task-dataplane_flags = ~PR_DATAPLANE_QUIESCE; + local_irq_enable(); + do_group_exit(SIGKILL); + } } So while I'm all for hard fails like this, can we not provide a wee bit more information in the siginfo ? And maybe use a slightly less fatal signal, such that userspace can actually catch it and dump state in debug modes? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On May 13, 2015 6:06 AM, Chris Metcalf cmetc...@ezchip.com wrote: On 05/11/2015 06:28 PM, Andy Lutomirski wrote: [add peterz due to perf stuff] On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf cmetc...@ezchip.com wrote: Patch 6/6 proposes a mechanism to track down times when the kernel screws up and delivers an IRQ to a userspace-only task. Here, we're just trying to identify the times when an application screws itself up out of cluelessness, and provide a mechanism that allows the developer to easily figure out why and fix it. In particular, /proc/interrupts won't show syscalls or page faults, which are two easy ways applications can screw themselves when they think they're in userspace-only mode. Also, they don't provide sufficient precision to make it clear what part of the application caused the undesired kernel entry. Perf does, though, complete with context. The perf_event suggestions are interesting, but I think it's plausible for this to be an alternate way to debug the issues that STRICT addresses. In this case, killing the task is appropriate, since that's exactly the semantics that have been asked for - it's like on architectures that don't natively support unaligned accesses, but fake it relatively slowly in the kernel, and in development you just say give me a SIGBUS when that happens and in production you might say fix it up and let's try to keep going. I think more control is needed. I also think that, if we go this route, we should distinguish syscalls, synchronous non-syscall entries, and asynchronous non-syscall entries. They're quite different. I don't think it's necessary to distinguish the types. As long as we have a PC pointing to the instruction that triggered the problem, we can see if it's a system call instruction, a memory write that caused a page fault, a trap instruction, etc. Not true. PC right after a syscall insn could be any type of kernel entry, and you can't even reliably tell whether the syscall insn was executed or, on x86, whether it was a syscall at all. (x86 insns can't be reliably decided backwards.) PC pointing at a load could be a page fault or an IPI. We certainly could add infrastructure to capture syscall numbers, fault/signal numbers, etc etc, but I think it's overkill if it adds kernel overhead on entry/exit. None of these should add overhead. A better implementation, I think, is to put the tests for you screwed up and synchronously entered the kernel in the syscall_trace_enter() code, which TIF_NOHZ already gets us into; No, not unless you're planning on using that to distinguish syscalls from other stuff *and* people think that's justified. So, the question is how we separate synchronous entries from IRQs? At a high level, IRQs are kernel bugs (for cpu-isolated tasks), and synchronous entries are application bugs. We'd like to deliver a signal for the latter, and do some kind of kernel diagnostics for the former. So we can't just add the test in the context tracking code, which doesn't actually know why we're entering or exiting. Synchronous entries could be VM bugs, too. That's why I was thinking that the syscall_trace_entry and exception_enter paths were the best choices. I'm fairly sure that exception_enter is only done for synchronous traps, page faults, etc. Maybe. Doing it through the actual entry/exit slow paths would be overhead-free, although I'm not sure that IRQs have real slow paths for entry. Certainly on the tile architecture we include the trap number in the pt_regs, so it's possible to just examine the pt_regs and know why you entered or are exiting the kernel, but I don't think we can rely on that for all architectures. x86 can't do this. I'll put out a v2 of my patch that does both the things you advise against :-) just so we can have a strawman to think about how to do it better - unless you have a suggestion offhand as to how we can better differentiate sync and async entries into the kernel in a platform-independent way. I could imagine modifying user_exit() and exception_enter() to pass an identifier into the context system saying why they were changing contexts, so we could have syscalls, trap numbers, fault numbers, etc., and some way to query as to whether they were synchronous or asynchronous, and build this scheme on top of that, but I'm not sure the extra infrastructure is worthwhile. I'll take a look. Again, though, I think we really do need to distinguish at least MCE and NMI (on x86) from the others. What if we added a mode to perf where delivery of a sample synchronously (or semi-synchronously by catching it on the next exit to userspace) freezes the delivering task? It would be like debugger support via perf. peterz, do you think this would be a sensible thing to add to perf? It would only make sense for some types of events (tracepoints and hw_breakpoints mostly, I think). I suspect it's reasonable
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On Tue, May 12, 2015 at 11:38:58AM +0200, Peter Zijlstra wrote: On Fri, May 08, 2015 at 01:58:46PM -0400, Chris Metcalf wrote: +++ b/kernel/time/tick-sched.c @@ -436,6 +436,20 @@ static void dataplane_quiesce(void) (jiffies - start)); dump_stack(); } + + /* +* Kill the process if it violates STRICT mode. Note that this +* code also results in killing the task if a kernel bug causes an +* irq to be delivered to this core. +*/ + if ((task-dataplane_flags (PR_DATAPLANE_STRICT|PR_DATAPLANE_PRCTL)) + == PR_DATAPLANE_STRICT) { + pr_warn(Dataplane STRICT mode violated; process killed.\n); + dump_stack(); + task-dataplane_flags = ~PR_DATAPLANE_QUIESCE; + local_irq_enable(); + do_group_exit(SIGKILL); + } } So while I'm all for hard fails like this, can we not provide a wee bit more information in the siginfo ? And maybe use a slightly less fatal signal, such that userspace can actually catch it and dump state in debug modes? Agreed, a bit more debug state would be helpful. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
[add peterz due to perf stuff] On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf wrote: > On 05/09/2015 03:28 AM, Andy Lutomirski wrote: >> >> On May 8, 2015 11:44 PM, "Chris Metcalf" wrote: >>> >>> With QUIESCE mode, the task is in principle guaranteed not to be >>> interrupted by the kernel, but only if it behaves. In particular, >>> if it enters the kernel via system call, page fault, or any of >>> a number of other synchronous traps, it may be unexpectedly >>> exposed to long latencies. Add a simple flag that puts the process >>> into a state where any such kernel entry is fatal. >>> >>> To allow the state to be entered and exited, we add an internal >>> bit to current->dataplane_flags that is set when prctl() sets the >>> flags. That way, when we are exiting the kernel after calling >>> prctl() to forbid future kernel exits, we don't get immediately >>> killed. >> >> Is there any reason this can't already be addressed in userspace using >> /proc/interrupts or perf_events? ISTM the real goal here is to detect >> when we screw up and fail to avoid an interrupt, and killing the task >> seems like overkill to me. > > > Patch 6/6 proposes a mechanism to track down times when the > kernel screws up and delivers an IRQ to a userspace-only task. > Here, we're just trying to identify the times when an application > screws itself up out of cluelessness, and provide a mechanism > that allows the developer to easily figure out why and fix it. > > In particular, /proc/interrupts won't show syscalls or page faults, > which are two easy ways applications can screw themselves > when they think they're in userspace-only mode. Also, they don't > provide sufficient precision to make it clear what part of the > application caused the undesired kernel entry. Perf does, though, complete with context. > > In this case, killing the task is appropriate, since that's exactly > the semantics that have been asked for - it's like on architectures > that don't natively support unaligned accesses, but fake it relatively > slowly in the kernel, and in development you just say "give me a > SIGBUS when that happens" and in production you might say > "fix it up and let's try to keep going". I think more control is needed. I also think that, if we go this route, we should distinguish syscalls, synchronous non-syscall entries, and asynchronous non-syscall entries. They're quite different. > > You can argue that this is something that can be done by ftrace, > but certainly you'd want to have a way to programmatically > turn on ftrace at the moment when you're entering userspace-only > mode, so we'd want some API around that anyway. And honestly, > it's so easy to test a task state bit in a couple of places and > generate the failurel on the spot, vs. the relative complexity > of setting up and understanding ftrace, that I think it merits > inclusion on that basis alone. perf_event, not ftrace. > >> Also, can we please stop further torturing the exit paths? We have a >> disaster of assembly code that calls into syscall_trace_leave and >> do_notify_resume. Those functions, in turn, *both* call user_enter >> (WTF?), and on very brief inspection user_enter makes it into the nohz >> code through multiple levels of indirection, which, with these >> patches, has yet another conditionally enabled helper, which does this >> new stuff. It's getting to be impossible to tell what happens when we >> exit to user space any more. >> >> Also, I think your code is buggy. There's no particular guarantee >> that user_enter is only called once between sys_prctl and the final >> exit to user mode (see the above WTF), so you might spuriously kill >> the process. > > > This is a good point; I also find the x86 kernel entry and exit > paths confusing, although I've reviewed them a bunch of times. > The tile architecture paths are a little easier to understand. > > That said, I think the answer here is avoid non-idempotent > actions in the dataplane code, such as clearing a syscall bit. > > A better implementation, I think, is to put the tests for "you > screwed up and synchronously entered the kernel" in > the syscall_trace_enter() code, which TIF_NOHZ already > gets us into; No, not unless you're planning on using that to distinguish syscalls from other stuff *and* people think that's justified. It's far to easy to just make a tiny change to the entry code. Add a tiny trivial change here, a few lines of asm (that's you, audit!) there, some weird written-in-asm scheduling code over here, and you end up with the truly awful mess that we currently have. If it really makes sense for this stuff to go with context tracking, then fine, but we should *fix* the context tracking first rather than kludging around it. I already have a prototype patch for the relevant part of that. > there, we can test if the dataplane "strict" bit is > set and the syscall is not prctl(), then we generate the error. > (We'd exclude exit and exit_group here too,
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On 05/09/2015 03:28 AM, Andy Lutomirski wrote: On May 8, 2015 11:44 PM, "Chris Metcalf" wrote: With QUIESCE mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current->dataplane_flags that is set when prctl() sets the flags. That way, when we are exiting the kernel after calling prctl() to forbid future kernel exits, we don't get immediately killed. Is there any reason this can't already be addressed in userspace using /proc/interrupts or perf_events? ISTM the real goal here is to detect when we screw up and fail to avoid an interrupt, and killing the task seems like overkill to me. Patch 6/6 proposes a mechanism to track down times when the kernel screws up and delivers an IRQ to a userspace-only task. Here, we're just trying to identify the times when an application screws itself up out of cluelessness, and provide a mechanism that allows the developer to easily figure out why and fix it. In particular, /proc/interrupts won't show syscalls or page faults, which are two easy ways applications can screw themselves when they think they're in userspace-only mode. Also, they don't provide sufficient precision to make it clear what part of the application caused the undesired kernel entry. In this case, killing the task is appropriate, since that's exactly the semantics that have been asked for - it's like on architectures that don't natively support unaligned accesses, but fake it relatively slowly in the kernel, and in development you just say "give me a SIGBUS when that happens" and in production you might say "fix it up and let's try to keep going". You can argue that this is something that can be done by ftrace, but certainly you'd want to have a way to programmatically turn on ftrace at the moment when you're entering userspace-only mode, so we'd want some API around that anyway. And honestly, it's so easy to test a task state bit in a couple of places and generate the failurel on the spot, vs. the relative complexity of setting up and understanding ftrace, that I think it merits inclusion on that basis alone. Also, can we please stop further torturing the exit paths? We have a disaster of assembly code that calls into syscall_trace_leave and do_notify_resume. Those functions, in turn, *both* call user_enter (WTF?), and on very brief inspection user_enter makes it into the nohz code through multiple levels of indirection, which, with these patches, has yet another conditionally enabled helper, which does this new stuff. It's getting to be impossible to tell what happens when we exit to user space any more. Also, I think your code is buggy. There's no particular guarantee that user_enter is only called once between sys_prctl and the final exit to user mode (see the above WTF), so you might spuriously kill the process. This is a good point; I also find the x86 kernel entry and exit paths confusing, although I've reviewed them a bunch of times. The tile architecture paths are a little easier to understand. That said, I think the answer here is avoid non-idempotent actions in the dataplane code, such as clearing a syscall bit. A better implementation, I think, is to put the tests for "you screwed up and synchronously entered the kernel" in the syscall_trace_enter() code, which TIF_NOHZ already gets us into; there, we can test if the dataplane "strict" bit is set and the syscall is not prctl(), then we generate the error. (We'd exclude exit and exit_group here too, since we don't need to shoot down a task that's just trying to kill itself.) This needs a bit of platform-specific code for each platform, but that doesn't seem like too big a problem. Likewise we can test in exception_enter() since that's only called for all the synchronous user entries like page faults. Also, I think that most users will be quite surprised if "strict dataplane" code causes any machine check on the system to kill your dataplane task. Fair point, and avoided by testing as described above instead. (Though presumably in development it's not such a big deal, and as I said you'd likely turn it off in production.) Similarly, a user accidentally running perf record -a probably should have some reasonable semantics. Yes, also avoided by doing this as above, though I'd argue we could also just say that running perf disables this mode. But it's not as clean as the above suggestion. On 05/09/2015 06:37 AM, Gilad Ben Yossef wrote: So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case? The upsides I see is that the user gets her notification
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On 05/09/2015 03:28 AM, Andy Lutomirski wrote: On May 8, 2015 11:44 PM, Chris Metcalf cmetc...@ezchip.com wrote: With QUIESCE mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current-dataplane_flags that is set when prctl() sets the flags. That way, when we are exiting the kernel after calling prctl() to forbid future kernel exits, we don't get immediately killed. Is there any reason this can't already be addressed in userspace using /proc/interrupts or perf_events? ISTM the real goal here is to detect when we screw up and fail to avoid an interrupt, and killing the task seems like overkill to me. Patch 6/6 proposes a mechanism to track down times when the kernel screws up and delivers an IRQ to a userspace-only task. Here, we're just trying to identify the times when an application screws itself up out of cluelessness, and provide a mechanism that allows the developer to easily figure out why and fix it. In particular, /proc/interrupts won't show syscalls or page faults, which are two easy ways applications can screw themselves when they think they're in userspace-only mode. Also, they don't provide sufficient precision to make it clear what part of the application caused the undesired kernel entry. In this case, killing the task is appropriate, since that's exactly the semantics that have been asked for - it's like on architectures that don't natively support unaligned accesses, but fake it relatively slowly in the kernel, and in development you just say give me a SIGBUS when that happens and in production you might say fix it up and let's try to keep going. You can argue that this is something that can be done by ftrace, but certainly you'd want to have a way to programmatically turn on ftrace at the moment when you're entering userspace-only mode, so we'd want some API around that anyway. And honestly, it's so easy to test a task state bit in a couple of places and generate the failurel on the spot, vs. the relative complexity of setting up and understanding ftrace, that I think it merits inclusion on that basis alone. Also, can we please stop further torturing the exit paths? We have a disaster of assembly code that calls into syscall_trace_leave and do_notify_resume. Those functions, in turn, *both* call user_enter (WTF?), and on very brief inspection user_enter makes it into the nohz code through multiple levels of indirection, which, with these patches, has yet another conditionally enabled helper, which does this new stuff. It's getting to be impossible to tell what happens when we exit to user space any more. Also, I think your code is buggy. There's no particular guarantee that user_enter is only called once between sys_prctl and the final exit to user mode (see the above WTF), so you might spuriously kill the process. This is a good point; I also find the x86 kernel entry and exit paths confusing, although I've reviewed them a bunch of times. The tile architecture paths are a little easier to understand. That said, I think the answer here is avoid non-idempotent actions in the dataplane code, such as clearing a syscall bit. A better implementation, I think, is to put the tests for you screwed up and synchronously entered the kernel in the syscall_trace_enter() code, which TIF_NOHZ already gets us into; there, we can test if the dataplane strict bit is set and the syscall is not prctl(), then we generate the error. (We'd exclude exit and exit_group here too, since we don't need to shoot down a task that's just trying to kill itself.) This needs a bit of platform-specific code for each platform, but that doesn't seem like too big a problem. Likewise we can test in exception_enter() since that's only called for all the synchronous user entries like page faults. Also, I think that most users will be quite surprised if strict dataplane code causes any machine check on the system to kill your dataplane task. Fair point, and avoided by testing as described above instead. (Though presumably in development it's not such a big deal, and as I said you'd likely turn it off in production.) Similarly, a user accidentally running perf record -a probably should have some reasonable semantics. Yes, also avoided by doing this as above, though I'd argue we could also just say that running perf disables this mode. But it's not as clean as the above suggestion. On 05/09/2015 06:37 AM, Gilad Ben Yossef wrote: So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case? The upsides I see is that the user gets her notification
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
[add peterz due to perf stuff] On Mon, May 11, 2015 at 12:13 PM, Chris Metcalf cmetc...@ezchip.com wrote: On 05/09/2015 03:28 AM, Andy Lutomirski wrote: On May 8, 2015 11:44 PM, Chris Metcalf cmetc...@ezchip.com wrote: With QUIESCE mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current-dataplane_flags that is set when prctl() sets the flags. That way, when we are exiting the kernel after calling prctl() to forbid future kernel exits, we don't get immediately killed. Is there any reason this can't already be addressed in userspace using /proc/interrupts or perf_events? ISTM the real goal here is to detect when we screw up and fail to avoid an interrupt, and killing the task seems like overkill to me. Patch 6/6 proposes a mechanism to track down times when the kernel screws up and delivers an IRQ to a userspace-only task. Here, we're just trying to identify the times when an application screws itself up out of cluelessness, and provide a mechanism that allows the developer to easily figure out why and fix it. In particular, /proc/interrupts won't show syscalls or page faults, which are two easy ways applications can screw themselves when they think they're in userspace-only mode. Also, they don't provide sufficient precision to make it clear what part of the application caused the undesired kernel entry. Perf does, though, complete with context. In this case, killing the task is appropriate, since that's exactly the semantics that have been asked for - it's like on architectures that don't natively support unaligned accesses, but fake it relatively slowly in the kernel, and in development you just say give me a SIGBUS when that happens and in production you might say fix it up and let's try to keep going. I think more control is needed. I also think that, if we go this route, we should distinguish syscalls, synchronous non-syscall entries, and asynchronous non-syscall entries. They're quite different. You can argue that this is something that can be done by ftrace, but certainly you'd want to have a way to programmatically turn on ftrace at the moment when you're entering userspace-only mode, so we'd want some API around that anyway. And honestly, it's so easy to test a task state bit in a couple of places and generate the failurel on the spot, vs. the relative complexity of setting up and understanding ftrace, that I think it merits inclusion on that basis alone. perf_event, not ftrace. Also, can we please stop further torturing the exit paths? We have a disaster of assembly code that calls into syscall_trace_leave and do_notify_resume. Those functions, in turn, *both* call user_enter (WTF?), and on very brief inspection user_enter makes it into the nohz code through multiple levels of indirection, which, with these patches, has yet another conditionally enabled helper, which does this new stuff. It's getting to be impossible to tell what happens when we exit to user space any more. Also, I think your code is buggy. There's no particular guarantee that user_enter is only called once between sys_prctl and the final exit to user mode (see the above WTF), so you might spuriously kill the process. This is a good point; I also find the x86 kernel entry and exit paths confusing, although I've reviewed them a bunch of times. The tile architecture paths are a little easier to understand. That said, I think the answer here is avoid non-idempotent actions in the dataplane code, such as clearing a syscall bit. A better implementation, I think, is to put the tests for you screwed up and synchronously entered the kernel in the syscall_trace_enter() code, which TIF_NOHZ already gets us into; No, not unless you're planning on using that to distinguish syscalls from other stuff *and* people think that's justified. It's far to easy to just make a tiny change to the entry code. Add a tiny trivial change here, a few lines of asm (that's you, audit!) there, some weird written-in-asm scheduling code over here, and you end up with the truly awful mess that we currently have. If it really makes sense for this stuff to go with context tracking, then fine, but we should *fix* the context tracking first rather than kludging around it. I already have a prototype patch for the relevant part of that. there, we can test if the dataplane strict bit is set and the syscall is not prctl(), then we generate the error. (We'd exclude exit and exit_group here too, since we don't need to shoot down a task that's just trying to kill itself.) This needs a bit of
RE: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
> From: Andy Lutomirski [mailto:l...@amacapital.net] > Sent: Saturday, May 09, 2015 10:29 AM > To: Chris Metcalf > Cc: Srivatsa S. Bhat; Paul E. McKenney; Frederic Weisbecker; Ingo Molnar; > Rik van Riel; linux-...@vger.kernel.org; Andrew Morton; linux- > ker...@vger.kernel.org; Thomas Gleixner; Tejun Heo; Peter Zijlstra; Steven > Rostedt; Christoph Lameter; Gilad Ben Yossef; Linux API > Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode > > On May 8, 2015 11:44 PM, "Chris Metcalf" wrote: > > > > With QUIESCE mode, the task is in principle guaranteed not to be > > interrupted by the kernel, but only if it behaves. In particular, > > if it enters the kernel via system call, page fault, or any of > > a number of other synchronous traps, it may be unexpectedly > > exposed to long latencies. Add a simple flag that puts the process > > into a state where any such kernel entry is fatal. > > > > To allow the state to be entered and exited, we add an internal > > bit to current->dataplane_flags that is set when prctl() sets the > > flags. That way, when we are exiting the kernel after calling > > prctl() to forbid future kernel exits, we don't get immediately > > killed. > > Is there any reason this can't already be addressed in userspace using > /proc/interrupts or perf_events? ISTM the real goal here is to detect > when we screw up and fail to avoid an interrupt, and killing the task > seems like overkill to me. > > Also, can we please stop further torturing the exit paths? So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case? The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path? Gilad N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a��� 0��h���i
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On May 8, 2015 11:44 PM, "Chris Metcalf" wrote: > > With QUIESCE mode, the task is in principle guaranteed not to be > interrupted by the kernel, but only if it behaves. In particular, > if it enters the kernel via system call, page fault, or any of > a number of other synchronous traps, it may be unexpectedly > exposed to long latencies. Add a simple flag that puts the process > into a state where any such kernel entry is fatal. > > To allow the state to be entered and exited, we add an internal > bit to current->dataplane_flags that is set when prctl() sets the > flags. That way, when we are exiting the kernel after calling > prctl() to forbid future kernel exits, we don't get immediately > killed. Is there any reason this can't already be addressed in userspace using /proc/interrupts or perf_events? ISTM the real goal here is to detect when we screw up and fail to avoid an interrupt, and killing the task seems like overkill to me. Also, can we please stop further torturing the exit paths? We have a disaster of assembly code that calls into syscall_trace_leave and do_notify_resume. Those functions, in turn, *both* call user_enter (WTF?), and on very brief inspection user_enter makes it into the nohz code through multiple levels of indirection, which, with these patches, has yet another conditionally enabled helper, which does this new stuff. It's getting to be impossible to tell what happens when we exit to user space any more. Also, I think your code is buggy. There's no particular guarantee that user_enter is only called once between sys_prctl and the final exit to user mode (see the above WTF), so you might spuriously kill the process. Also, I think that most users will be quite surprised if "strict dataplane" code causes any machine check on the system to kill your dataplane task. Similarly, a user accidentally running perf record -a probably should have some reasonable semantics. /proc/interrupts gets that right as is. Sure, MCEs will hurt your RT performance, but Intel screwed up the way that MCEs work, so we should make do. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
From: Andy Lutomirski [mailto:l...@amacapital.net] Sent: Saturday, May 09, 2015 10:29 AM To: Chris Metcalf Cc: Srivatsa S. Bhat; Paul E. McKenney; Frederic Weisbecker; Ingo Molnar; Rik van Riel; linux-...@vger.kernel.org; Andrew Morton; linux- ker...@vger.kernel.org; Thomas Gleixner; Tejun Heo; Peter Zijlstra; Steven Rostedt; Christoph Lameter; Gilad Ben Yossef; Linux API Subject: Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode On May 8, 2015 11:44 PM, Chris Metcalf cmetc...@ezchip.com wrote: With QUIESCE mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current-dataplane_flags that is set when prctl() sets the flags. That way, when we are exiting the kernel after calling prctl() to forbid future kernel exits, we don't get immediately killed. Is there any reason this can't already be addressed in userspace using /proc/interrupts or perf_events? ISTM the real goal here is to detect when we screw up and fail to avoid an interrupt, and killing the task seems like overkill to me. Also, can we please stop further torturing the exit paths? So, I don't know if it is a practical suggestion or not, but would it better/easier to mark a pending signal on kernel entry for this case? The upsides I see is that the user gets her notification (killing the task or just logging the event in a signal handler) and hopefully since return to userspace with a pending signal is already handled we don't need new code in the exit path? Gilad N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a��� 0��h���i
Re: [PATCH 5/6] nohz: support PR_DATAPLANE_STRICT mode
On May 8, 2015 11:44 PM, Chris Metcalf cmetc...@ezchip.com wrote: With QUIESCE mode, the task is in principle guaranteed not to be interrupted by the kernel, but only if it behaves. In particular, if it enters the kernel via system call, page fault, or any of a number of other synchronous traps, it may be unexpectedly exposed to long latencies. Add a simple flag that puts the process into a state where any such kernel entry is fatal. To allow the state to be entered and exited, we add an internal bit to current-dataplane_flags that is set when prctl() sets the flags. That way, when we are exiting the kernel after calling prctl() to forbid future kernel exits, we don't get immediately killed. Is there any reason this can't already be addressed in userspace using /proc/interrupts or perf_events? ISTM the real goal here is to detect when we screw up and fail to avoid an interrupt, and killing the task seems like overkill to me. Also, can we please stop further torturing the exit paths? We have a disaster of assembly code that calls into syscall_trace_leave and do_notify_resume. Those functions, in turn, *both* call user_enter (WTF?), and on very brief inspection user_enter makes it into the nohz code through multiple levels of indirection, which, with these patches, has yet another conditionally enabled helper, which does this new stuff. It's getting to be impossible to tell what happens when we exit to user space any more. Also, I think your code is buggy. There's no particular guarantee that user_enter is only called once between sys_prctl and the final exit to user mode (see the above WTF), so you might spuriously kill the process. Also, I think that most users will be quite surprised if strict dataplane code causes any machine check on the system to kill your dataplane task. Similarly, a user accidentally running perf record -a probably should have some reasonable semantics. /proc/interrupts gets that right as is. Sure, MCEs will hurt your RT performance, but Intel screwed up the way that MCEs work, so we should make do. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/