On Thu, Nov 09, 2006 at 03:39:17PM -0500, Ben Romer wrote:
> For the last several weeks I've been trying to get to the root of a
> kexec clock problem we've been seeing on our ES7000/ONE systems. I've
> been down various routes, verifying that the system is going into
> virtual wire mode correctly and trying to turn the PM clock back on
> before rebooting, but I have been unable to get a kexec dump without
> passing in a loops-per-jiffy value on the kexec command line. The clock
> seems to re-activate once the APIC is initialized, but before that I get
> nothing at all. Without the lpj= value, the kexec'd kernel hangs while
> trying to compute lpj.
> 

So timer interrupts are not coming in second kernel hence jiffies
don't get updated and you hang in calibrate_delay_loop()?

Is it anyway related to boot cpu? If you boot your first kernel with only
one processor, do you still see the issue? Does kexec work on this machine? 
If kexec works then probably its more of a software setting issue.

Does it work if second kernel is passed with command line option
"nolapic" ?

Sorry I am not very well versed with timers, hence a stupid question.
What is a PM timer? Can it driver the timer interrupt like an 8253/8254
chipset or an HPET can do? If yes, how the interrupt is routed to CPU
on your mahine?

Any idea in your system, initially how does BIOS setup the LAPIC/IOAPIC
to deliver the timer interrupt to the CPU? It is done directly through
LAPIC or routed through IOAPIC?

Have you been able to verify that after kdump, the timer interrupts
are not being generated at all or due to some routing issues, they are
not being delivered to the cpu? I think apic=verbose and using the
print_local_APIC() to print the states of local APIC and IOAPIC might
be of some help.  

> So, I have two questions I'd like to throw out here for discussion:
> 
> First, what do you think of adding a command-line parameter to the kexec
> program, that would grab an lpj value from the currently running kernel
> and append it to the kexec kernel's command line? As it stands now,
> customers who want to run SLES10 with kexec-dumps on our systems have to
> manually find the lpj value using dmesg, and customize their command
> line to fit, which could be a problem if they decide to upgrade CPUs and
> forget to update their kexec command line. I'm sure there are other
> platforms that might benefit from automating this, and I'm willing to
> write and submit the patch myself.
> 


- Most likely it is a software issue somewhere related to settings of
  the LAPIC/IOAPIC/timer chip etc. Then IMHO, we should fix the issue
  instead of a work around. If it boils down to some hardware limitation
  then ofcourse we don't have a way out. 

- How would you find the lpj value in kexec? dig out dmesg?

> Secondly, speaking in general - if the system clock is actually broken,
> using kexec for dumps won't ever work, will it? When the crash kernel
> tries to boot, having no clock will break the scheduler, even with an
> lpj value set. While a dead system clock may not be a very likely
> situation, it would seem that diskdump and/or LKCD had better chances of
> being able to take the dump under those conditions. 
> 

- Can diskdump or LKCD capture the dump if system clock is not working?
- Capturing the kernel core dump in the event of hardware failures,
  might not always be possible.  

Thanks
Vivek
_______________________________________________
fastboot mailing list
[email protected]
https://lists.osdl.org/mailman/listinfo/fastboot

Reply via email to