On Monday 24 March 2008 07:18:51 pm Jason Wessel wrote:
> Amit S. Kale wrote:
> > On Monday 17 March 2008 11:55:28 pm Konstantin Baydarov wrote:
> >> Problem:
> >> Sometimes(after remote gdb was connected) x86 SMP kernel(with KGDB and
> >> NMI watchdog enabled) hangs when kernel modules are automatically
> >> loaded.
> >
> > Konstantin,
> >
> > The description below doesn't mention how module loading comes into
> > picture.
>
> I too have observed this problem as well as hangs in the stress test
> where you ask each cpu to execute the same system call over and over
> (via a user space program) and you set a kernel breakpoint there.
>
> Specifically the problem Konstantin is referring to is when you attach a
> debugger, continue and then a number of kernel module loads are executed
> as a part of the whole user space startup or initrd startup.  The a
> kernel module aware debugger will stop, load symbols and automatically
> continue on each kernel module load event.

I was wondering how module load is different from a regular 
breakpoint-singlestep.

>
> >> Root Cause:
> >>   Slave CPU hangs in kgdb_wait() when master CPU leaves KGDB, causing
> >> the whole system to hang.
> >>   If watchdog NMI occurs when Slave CPU have already exited kgdb_wait()
> >> and Master CPU haven't unset debugger_active,
> >
> > An NMI watchdog can't occur until kgdb_wait function returns, control
> > goes to kgdb_nmihook, which returns control to kgdb_notify, which in turn
> > returns through the notify chain call returns, do_nmi, and then to
> > entry.S, where an iret is executed. (NMI is disabled until iret is
> > executed).
>
> The issue here is that there is a window where the slavecpu is unlocked
> with kgdb_spin_unlock(&slavecpulocks[i]).  After that there is a window
> where the slave cpu will spin up again and start taking NMI events based
> on how often the APIC timer is set to fire.  Even if you remove the
> msleep() it doesn't remove the window entirely and you can still have a
> processor re-enter the kgdb_wait() before debugger_active is zeroed out.

Agreed. There is definitely a window where this can happen. However given 
removal of msleep, I have doubts about how it'll be hit on x86 arch.

>
> > Compare to this to what the master CPU does: master CPU just has to
> > unlock all slave locks and then immediately set debugger_active to 0.
> > (The only exeception to this is when debugger_step is set. More about
> > this below).
> >
> > The later can be executed much quicker than the former and while in
> > theory the former can execute before the later, it can't happen in a
> > real-life situation.
> >
> > There is a delay of mdelay(2), when debugger_step is set and master
> > debugger lets other CPUs run (kgdb_contthread == 0), which could
> > potentially trigger this race. Could you please confirm whether any of
> > the following two solve your problem?
> >
> > 1. Comment out these lines from kgdb_handle_exception
> > +   if (debugger_step)
> > +           mdelay(2);
> >
> >
> > 2. Execute this command from gdb as soon as it is started "set
> > scheduler-locking on"
> >
> >> How Solved:
> >>   New atomic variable debugger_exiting was added. It's set when Master
> >> CPU starts waiting Slave CPUs, and is reset after debugger_active is set
> >> to zero. Variable debugger_exiting is checked in kgdb_notify() and
> >> kgdb_nmihook wouldn't be called until debugger_exiting equal zero. So
> >> debugger_exiting guaranties that Slave CPU won't reenter kgdb_wait()
> >> until Master CPU completely leaves KGDB. Patch against kernel 2.6.24.3.
> >
> > I would strongly recommend adding any more locking variables. As it is
> > we've sufficient difficulty analyzing races :-).
> > -Amit
>
> I assume you meant to recommend against adding any more locking variables.

Yes. I am generally against adding new locking variables since we already have 
got enough of them. We haven't defined a good hierarchy for them (resulting 
in spinlock lock detection false alarms).

>
> Something that serves the same purpose as this particular variable is in
> fact needed.  I created patch to fix the same problem ~ 6 months ago
> (the new variable was called kgdb_resuming in my case), but the patch is
> even uglier in that I also added controls to change the behavior of the
> single stepping so as to allow another processor to hit a breakpoint
> while single stepping a different processor.
>
> In the last 1.5 months the kgdb core was significantly changed, as well
> as a kgdb test suite was added to test for some of these architecture
> specific issues.  It appears that the test case cannot be hit very often
> because one of the commits removed the msleep(), which definitely
> reduces the window of opportunity.
>
> In short, this is definitely a real problem and with the msleep() the
> window is large enough that it gets hit reasonably easily.  I plan to
> split the 2007 single_step / kgdb resuming patch to cover just the
> resuming case and I will test it on the new kgdb core.

It appears to solve the problem described above.

Does it make sure that we don't miss any NMI watchdog events on master 
processor and cause them to be routed to default panic handler? [Haven't 
thought through that for not having a quickly reachable current kgdb source]

-Amit

>
> As a side point, I back ported the new kgdb core to 2.6.24, which I can
> make available in cvs/git.  I am wondering if Sergei and or Konstantin
> would be willing to aid in getting it to work with some of these odd
> ball kgdb specific RS232 drivers?   It is not possible for me to truly
> test them because I don't have any of the boards.
>
> Jason.



-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport

Reply via email to