On Tue, 1 Apr 2003 13:09:15 +0530,
"S Vamsikrishna" <[EMAIL PROTECTED]> wrote:
>As far as I can tell, in kdb v4.0, you call kdb_save_running() /
>kdb_unsave_running() around kdb_main_loop() in kdba_main_loop(). We will
>execute kdba_main_loop() on all processors only if the KDB IPI (NMI class)
>to stop other processors works.
>
>Current lkcd does something similiar: it sends an NMI class IPI to capture
>register state of other processors. I agree that timing out the IPI is a
>bug (I thought I removed it, may be never submitted the patch for
>inclusion), but this method will capture the state of all CPUs as long as
>the IPI goes to all CPUs.
>
>So, how is the push model any better, when the push itself relies on the
>IPI to be delivered and handled on all cpus? Did I miss something? Can you
>please explain?
What you see in kdb v4.0 is only part of the change, I am working on
the next bit for kdb v4.1. IPI will not be the only mechanism for
getting the attention of the other cpus, there are also flag tests in
common code paths to detect that a debug event has occurred, even if
the IPI does not get through. kdb is not relying on a single method
for flagging a debug event, it uses a graduated set of attempts to get
data from the other cpus. This is an additional method of getting
other cpus into kdb, it does not replace the use of interrupts. The
kdb logic will be :-
if (safe_to_send_interrupts) {
send_kdb_ipi(normal_interrupt);
if (!all_cpus_in_kdb && arch_supports_nmi)
send_kdb_ipi(nmi);
if (!all_cpus_in_kdb && arch_supports_pmi)
send_kdb_ipi(pmi);
}
if (!all_cpus_in_kdb)
kdb_enter_debugger = 1; // flag to trip spin_locks and scheduler into kdb
if (!all_cpus_in_kdb && arch_supports_init &&
(user_requested_destructive_ipi || destructive_kernel_error_detected))
send_kdb_ipi(init);
if (!all_cpus_in_kdb)
kdb_printf("not all cpus entered the debugger");
On IA64 there are classes of interrupt even more important than NMI,
those interrupts are triggered by the PROM code. But the PROM requires
that the cpus save their state and reenter PROM, which makes it
impossible for the controlling cpu to ask the other cpus to save their
state later. IOW, relying on the controlling cpu getting a response
from all cpus (pull model) does not work. kdb v4.0 saves state on the
failing cpus as they go through the interrupt handlers, the state is
then available for the controlling cpu to read at its leisure (push
model).
BTW, timing out the lkcd IPI introduces its own race condition. I have
seen this error several times on IA64 :-
* The lkcd IPI did not get through because interrupts were masked on a
target cpu (remember that IA64 masks the NMI). The IPI was pending
but could not be delivered to the OS because the cpu was in a
disabled spinloop on a lock that lkcd had obtained.
* The IPI timed out and was "cancelled".
* lkcd ended its processing, released the lock.
* The cpu that was spinning got its lock, re-enabled interrupts, the
pending IPI was now delivered, but the IPI data pointer was NULL
("cancelled").
* Result is an oops during lkcd shutdown.