On Tue, Apr 01, 2003 at 06:19:16PM +1000, Keith Owens wrote:
> On Tue, 1 Apr 2003 13:09:15 +0530, 
> "S Vamsikrishna" <[EMAIL PROTECTED]> wrote:
> >As far as I can tell, in kdb v4.0, you call kdb_save_running() /
> >kdb_unsave_running() around kdb_main_loop() in kdba_main_loop(). We will
> >execute kdba_main_loop() on all processors only if the KDB IPI (NMI class)
> >to stop other processors works.
> >
> >Current lkcd does something similiar: it sends an NMI class IPI to capture
> >register state of other processors. I agree that timing out the IPI is a
> >bug (I thought I removed it, may be never submitted the patch for
> >inclusion), but this method will capture the state of all CPUs as long as
> >the IPI goes to all CPUs.
> >
> >So, how is the push model any better, when the push itself relies on the
> >IPI to be delivered and handled on all cpus? Did I miss something? Can you
> >please explain?
> 
> What you see in kdb v4.0 is only part of the change, I am working on
> the next bit for kdb v4.1.  IPI will not be the only mechanism for
> getting the attention of the other cpus, there are also flag tests in
> common code paths to detect that a debug event has occurred, even if
> the IPI does not get through.  kdb is not relying on a single method
> for flagging a debug event, it uses a graduated set of attempts to get
> data from the other cpus.  This is an additional method of getting
> other cpus into kdb, it does not replace the use of interrupts.  The
> kdb logic will be :-
> 
> if (safe_to_send_interrupts) {
>   send_kdb_ipi(normal_interrupt);
>   if (!all_cpus_in_kdb && arch_supports_nmi)
>     send_kdb_ipi(nmi);
>   if (!all_cpus_in_kdb && arch_supports_pmi)
>     send_kdb_ipi(pmi);
> }
> if (!all_cpus_in_kdb)
>   kdb_enter_debugger = 1;       // flag to trip spin_locks and scheduler into kdb

This makes a lot of sense to me.. if you test for this flag and save state in
some critical routines (say in spinlock's spin loop), we will be sure to get
the state of all cpus. good.

> BTW, timing out the lkcd IPI introduces its own race condition.  I have
> seen this error several times on IA64 :-
> 
> * The lkcd IPI did not get through because interrupts were masked on a
>   target cpu (remember that IA64 masks the NMI).  The IPI was pending
>   but could not be delivered to the OS because the cpu was in a
>   disabled spinloop on a lock that lkcd had obtained.
> * The IPI timed out and was "cancelled".
> * lkcd ended its processing, released the lock.
> * The cpu that was spinning got its lock, re-enabled interrupts, the
>   pending IPI was now delivered, but the IPI data pointer was NULL
>   ("cancelled").
> * Result is an oops during lkcd shutdown.
> 
> 
Of course. I have seen this too. In the absence of any mechanism to
cancel all pending IPIs, timing out and returning from smp_send_ipi
is silly. Should never have been done. We have had patches to
remove the timeout logic all together, send the IPI to capture
state and if it fails on some CPUs, just accept the fact that we dont
have state of all cpus and move on.

Capturing the state in spin loops when a flag say kdb_enter_debugger
is set seems like a good idea. There doesn't seem to be much point
in doing it at other places as if a cpu is not in a tight spin loop,
it would have got the (NMI class) IPI. But, adding test for a flag
and call to save register state within spinlock's code may
be considered 'bloat' and unacceptable by many people :-(.

Thanks for clarifying this,
Vamsi.

-- 
Vamsi Krishna S.
Linux Technology Center,
IBM Software Lab, Bangalore.
Ph: +91 80 5044959
Internet: [EMAIL PROTECTED]

Reply via email to