On Tue, 9 Jan 2018 14:53:56 -0800
Tejun Heo <t...@kernel.org> wrote:

> Hello, Steven.
> 
> On Tue, Jan 09, 2018 at 05:47:50PM -0500, Steven Rostedt wrote:
> > > Maybe it can break out eventually but that can take a really long
> > > time.  It's OOM.  Most of userland is waiting for reclaim.  There
> > > isn't all that much going on outside that and there can only be one
> > > CPU which is OOMing.  The kernel isn't gonna be all that chatty.  
> > 
> > Are you saying that the OOM is stuck printing over and over on a single
> > CPU. Perhaps we should fix THAT.  
> 
> I'm not sure what you meant but OOM code isn't doing anything bad

My point is, that your test is only hammering at a single CPU. You say
it is the scenario you see, which means that the OOM is printing out
more than it should, because if it prints it out once, it should not
print it out again for the same process, or go into a loop doing it
over and over on a single CPU. That would be a bug in the
implementation.

> other than excluding others from doing OOM kills simultaneously, which
> is what we want, and printing a lot of messages and then gets caught
> up in a positive feedback loop.
> 
> To me, the whole point of this effort is preventing printk messages
> from causing significant or critical disruptions to overall system
> operation.

I agree, and my patch helps with this tremendously, if we are not doing
something stupid like printk thousands of times in an interrupt
handler, over and over on a single CPU.

>  IOW, it's rather dumb if the machine goes down because
> somebody printk'd wrong or just failed to foresee the combinations of
> events which could lead to such conditions.

I still like to see a trace of a real situation.

> 
> It's not like we don't know how to fix this either.

But we don't want the fix to introduce regressions, and offloading
printk does. Heck, the current fixes to printk has causes issues for me
in my own debugging. Like we can no longer do large dumps of printk from
NMI context. Which I use to do when detecting a lock up and then doing
a task list dump of all tasks. Or even a ftrace_dump_on_oops.

http://lkml.kernel.org/r/20180109162019.gl3...@hirez.programming.kicks-ass.net


-- Steve

Reply via email to