Ethan Solomita wrote: > > Jim Houston wrote: > > > > The real problem is the non-maskable part of the non- > > maskable IPI. When there are multiple processors hitting > > breakpoints at the same time, you never know how much of > > the initial entry code the slave processor got to execute > > before it was hit by the NMI. > > > This is what I explicitly considered fixed by these changes. In kdb(), > line 1342 is the beginning of the code where kdb_initial_cpu is grabbed. > After this block, you either are the kdb_initial_cpu, or you entered kdb > because of the IPI. So the future-slave-proecssor could not have gotten > past this if () clause before it was hit by the NMI. > > Looking back before this, there are very few lines of code that examine > global state, and none that modify global state. The few references to > KDB_STATE before line 1342 can, I believe all be justified. Either the > code knows that it is kdb_initial_cpu, or it is DOING_SS in which case > we cannot have received an IPI from KDB, or it is HOLD_CPU. HOlD_CPU is > used to generate "reentry", and I'm not sure why, but it seems harmless. > > Can you suggest a code path through kdb() which could lead to harm for > a CPU which hits a breakpoint, fails to win the race for > kdb_initial_cpu, and gets an IPI? > > > I have a couple of ideas in the works. First, I wonder about > > having the kdb_ipi() check if it has interrupted a > > breakpoint entry. If it has, it could just set a flag and > > return. I might do this with a stack trace back or by > > setting a flag early in the breakpoint handling (e.g. entry.S). > > I don't see how this helps -- whoever won the race for kdb_initial_cpu > is expecting all the CPUs to gather up and enter kdb. I would expect > that everyone who hits a breakpoint should enter kdb. > > > Ethan, I'm curious if you're using an NMI on the Sparc. > > > Sparc doesn't have an NMI, but the interrupt I use (an IPI) is rarely > blocked in the kernel. Certainly not blocked by local_irq_save() and > family. > -- Ethan
Hi Ethan, I have been in hack mode, and I probably have some self inflicted problems. Your analysis seems correct, but I still had problems with the combination of your patch + the version kdba_bp.c that I sent out on Friday. I did not mean to impugn your patch and appologize if I have. The initial enthusiasm wore off once I started putting breakpoints at places like do_schedule or sys_open. More often than not, it hung. I also ran into the panic processing breakpoints that have been removed. They are described in the comment before kdba_db_trap(). It still hung even doing bd instead of bc. I went on to experiment with splitting the kdb_state into separate variables for the per-cpu-private vs inter-cpu synchronization. I was hoping that I could simplify the problem by eliminating the interactions between most of the flags. I was worried about interactions between processors leaving kdb and new arrivals. Regards NMI racing with normal breakpoints - I want to solve a larger problem. If I can avoid the extra layer of nesting, I will solve the deleted breakpoint problem. It seems ugly to switch to the other cpu, do a stack trace and see part of kdb rather than what that cpu was doing. I would also like to switch to the other cpu and then single step. I also worry about what happens if the NMI interupts the spinlock which protects kdb_initial_cpu. I have some changes maybe 50% done. I'm using a flag set in entry.s to detect that the NMI has interrupted the breakpoint entry. Hopefully I will have something useful in another day or so. Jim Houston - Concurrent Computer Corp.
