On Tue, 19 Dec 2017 09:52:48 +0900 Sergey Senozhatsky <sergey.senozhatsky.w...@gmail.com> wrote:
> > The case here, you are talking about a CPU doing console_lock() from a > > non printk() case. Which is what I was asking about how often this > > happens. > > I'd say often enough. but the point I was trying to make is that we can > have non-atomic CPUs which can do the print out, instead of "sharing the > load" between atomic CPUs. We don't even know if sharing between "atomic" and "non-atomic" is an issue. Anything that does a printk() in an atomic location, is going to have latency to begin with. > > > As for why there's no handoff. Does the non printk() > > console_lock/unlock ever happen from a critical location? I don't think > > it does (but I haven't checked). Then it is the perfect candidate to do > > all the printing. > > that's right. that is the point I was trying making. we can have better > candidates to do all the printing. Sure, but we don't even know if we have to. A problem scenario hasn't come up that wasn't due to the current implementation (which my patch changes). > I did tests yesterday, traces are available. I can't conclude that > the patch fixes the unfairness of printk(). It doesn't fix the "unfairness" it fixes the unboundedness of printk. That is what has been triggering all the issues from before. > consider the following case > > we have console_lock() from non-atomic context. console_sem owner is > getting preempted, under console_sem. which is totally possible and > happens a lot. in the mean time we have OOM, which can print a lot of > info. by the time console_sem returns back to TASK_RUNNING logbuf > contains some number of pending messages [lets say 10 seconds worth > of printing]. console owner goes to console_unlock(). accidentally > we have printk from IRQ on CPUz. console_owner hands over printing > duty to CPUz. so now we have to print 10 seconds worth of OOM messages > from irq. Yes that can happen. But printk's from irq context is not nice to have either, and should only happen when things are going wrong to begin with. > > > > CPU0 CPU1 ~ CPUx CPUz > > console_lock > > << preempted >> > > > OOM OOM printouts, lots > of OOM traces, etc. > > OOM end [progress done]. > > << back to RUNNING >> > > console_unlock() > > for (;;) > sets console_owner > call_console_drivers() IRQ > printk > sees console_owner > sets console_waiter > > clears console_owner > sees console_waier > handoff > for (;;) { > > call_console_drivers() > ??? lockup > } > up() > > this is how we down() from non-atomic and up() from atomic [if we make > it to up(). we might end up in NMI panic]. this scenario is totally possible, The printk buffer needs to be very big, and bad things have to happen first. This is a theoretical scenario, and I'd like to see it happen in the real world before we try to fix it. My patch should make printk behave *MUCH BETTER* than it currently does. If you are worried about NMI panics, then we could add a touch nmi within the printk loop. > isn't it? the optimistic expectation here is that some other printk() from > non-atomic CPU will jump in and take over printing from atomic CPUz. but I > don't see why we are counting on it. I don't see why we even care. Placing a printk in an atomic context is a problem to begin with, and should only happen if there's issues in the system. -- Steve