I previously reported that I'd got a SPARCserver 1000E running SMP 8x CPUs by a hack upon which David Miller heaped an unjustified amount of scorn and opprobrium.
Well, I think I've tracked down the underlying reason and thought it worth writing up in case it's relevant to any other architectures. So far I'm working with 2.2.20, I expect the fix is OK for 2.2.26 (last version of 2.2) and I've made some progress with 2.3. To set the scene, here's my original hack: > Looking at what other people had reported about this problem and using the > LEDs and PROM debugging messages I eventually determined that the DAE can be > avoided by making a one-line kernel change, at which point the SS1000E runs > 8x CPU SMP reliably. Specifically, I have had more than one machine with up > to 8x SuperSparc-50s running with firmware 2.23, however I've had problems > with another firmware version where it gave a watchdog timeout /before/ the > "Booting Linux" message: I think this is probably a different issue. > > In arch/sparc/kernel/sun4d_smp.c there is a call to calibrate_delay(): this > should be commented out. As far as I can tell (and I stress that I am neither > a Sun guru nor a kernel hacker) it is only used for the secondary CPUs which > default to the same speed as the primary one- and who in their right mind > would try to run dissimilar CPUs SMP? > > Furthermore, looking at the calibrate_delay() code I suspect that the way > that the global loops_per_jiffy variable is being used as a scratchpad is > unsafe. Specifically, if on a particular SMP architecture (here sun4d) > interrupts are not fully disabled while calibrate_delay() is running then > anything which inadvertently uses the value of loops_per_jiffy could get > into trouble. I still fire that machine up now and again, for various jobs it's useful having that many CPUs even if they're slow. A few days ago I focussed on the fact that it was assigning all IRQs to CPU 0 irrespective of which board the interrupting device was on, this turned out to be because in arch/sparc/kernel/sun4d_irq.c sun4d_distribute_irqs() assumes that SBus_chain is initialised, but this isn't done until sbus_init() is called somewhat later. Making sure that sbus_init() is called before before SMP is set up fixes not only the IRQ distribution problem but also eliminates the requirement for my earlier hack, presumably because the kernel now knows how to protect the non-reentrant part of calibrate_delay(). Unfortunately I'm not able to test this on either of the larger sun4d systems- the Sun SPARCcenter or Cray CS6400. If anybody in the UK's got one of the latter looking for a good home I'd like to know about it :-) Incidentally I came across some interesting material a couple of days ago that indicates that the sun4d architecture actually originated at Xerox PARC in the late 80s http://www.fing.edu.uy/inco/grupos/cecal/hpc/proyectos/amstp/refs/Compcon-SC2000.pdf.gz and http://www.perfdynamics.com/Bio/njg.html Does anybody have any further information on this? -- Mark Morgan Lloyd markMLl .AT. telemetry.co .DOT. uk [Opinions above are the author's, not those of his employers or colleagues] -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

