On 09/02/2010 04:04 PM, Michael Neuling wrote: > In message <1283400367.2356.69.ca...@gandalf.stny.rr.com> you wrote: >> On Thu, 2010-09-02 at 11:02 +1000, Michael Neuling wrote: >> >>> We need to call smp_startup_cpu on boot when we the cpus are still in >>> FW. smp_startup_cpu does this for us on boot. >>> >>> I'm wondering if we just need to move the test down a bit to make sure >>> the preempt_count is set. I've not been following this thread, but >>> maybe this might work? >> >> Egad no! Setting the preempt_count to zero _is_ the bug. I think Darren >> even said that adding the exit prevented the bug (although now he's >> hitting a hard lockup someplace else). The original code he was using >> did not have the condition to return for kexec. It was just a >> coincidence that this code helped in bringing a CPU back online. >> >>> >>> Untested patch below... >>> >>> Mikey >>> >>> diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/ > pseries/smp.c >>> index 0317cce..3afaba4 100644 >>> --- a/arch/powerpc/platforms/pseries/smp.c >>> +++ b/arch/powerpc/platforms/pseries/smp.c >>> @@ -104,18 +104,18 @@ static inline int __devinit smp_startup_cpu(unsigned > int lcpu) >>> >>> pcpu = get_hard_smp_processor_id(lcpu); >>> >>> - /* Check to see if the CPU out of FW already for kexec */ >>> - if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){ >>> - cpumask_set_cpu(lcpu, of_spin_mask); >>> - return 1; >>> - } >>> - >>> /* Fixup atomic count: it exited inside IRQ handler. */ >>> task_thread_info(paca[lcpu].__current)->preempt_count = 0; >> >> We DON'T want to do the above. It's nasty! This is one CPU's task >> touching an intimate part of another CPU's task. It's equivalent of me >> putting my hand down you wife's blouse. It's offensive, and rude. >> >> OK, if the CPU was never online, then you can do what you want. But what >> we see is that this fails on CPU hotplug. You stop a CPU, and it goes >> into this cede_processor() call. When you wake it up, suddenly the task >> on that woken CPU has its preempt count fscked up. This was really >> really hard to debug. We thought it was stack corruption or something. >> But it ended up being that this code has one CPU touching the breasts of >> another CPU. This code is a pervert! >> >> What the trace clearly showed, was that we take down a CPU, and in doing >> so, the code on that CPU set the preempt count to 1, and it expected to >> have it as 1 when it returned. But the code that kicked started this CPU >> back to life (bring the CPU back online), set the preempt count on the >> task of that CPU to 0, and screwed everything up. > > /me goes to checks where this came from... > > It's been in the kernel since hotplug CPU support was added to ppc64 > back in 2004, so I guess we are all at fault for letting this pervert > get away with this stuff for so long in plain sight. :-) > > So I guess we should remove this but we need to audit all the different > paths that go through here to make sure they are OK with preempt. > Normal boot, kexec boot, hotplug with FW stop and hotplug with > extended_cede all hit this. > > Mikey
CC'ing my alter ego. -- Darren Hart IBM Linux Technology Center Real-Time Linux Team _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev