The overall question: how do we balance a desire to get optimal settings
with the desire to actually boot even if we make the wrong settings, so we
can fix things from over the network.

The way you get to "best" settings now is to get into CMOS via the BIOS
"hit <DEL>" approach and interactively set things. Even now, in CMOS
settings, you can set things in such a way that the machine won't boot.
Your only option is to physically open the machine up and reset CMOS.
BIOS writers assume you've got a keyboard, monitor, and screwdriver. This
will not work in the cluster world.

Also, we have found that the BIOS does a lousy job of configuring things.
We want to do better than BIOS, and we want to be able to tune the
hardware to run as well as it can. At the same time, we recently found a
case where changing one setting (CPU pipeline) on the 630 caused no
problems on some Celeron machines, and locked PIII machines up, and caused
other Celerons to spontaneously reboot. Ouch.

So, we need a way to make sure the machine will ALWAYS come up ("SAFE"
BIOS setting); we need a way to tune the settings for best performance; we
need a way to make all this happen while never having to pry the machine
open because we guessed wrong. My hope was that we could get Linux up, and
then apply the tuning parameters. There are a surprisingly large number of
bits that you can change even after the OS is up. However, I understand
that some things can not change more than once, e.g. SDRAM.

I want to be able to set new settings by running some tool in Linux.
Rebooting to test the new settings is OK. But it is very important that I
have a way to recover from wrong settings by repairing things over the
network. One way to do this is have Linux decide whether to apply new
settings or not. LinuxBIOS can do this too, but if LinuxBIOS applies
settings, where do we store them all? CMOS is too small! We can put them
in a special place in DoC, but not all systems have DoC.

One other possibilty, Eric Biederman mentioned: we can have an array of
"safe" settings in flash, and then arrays of "aggressive" settings. In
CMOS, we store a number which indicates whether to use aggressive
settings, and which setting to use; and another number (set by Linux) to
indicate that the system came up successfully to the OS. LinuxBIOS checks
the "came up okay"  variable. If it is set to 1, then we can use
aggressive; if it is set to 0, then we use safe. LinuxBIOS always clears
the "came up ok" variable once it checks it. Linux sets this variable once
it is completely up (in fact you could only allow it to be set from a
remote cluster control node; then you know you got on the net).

This still won't help if the aggressive settings make us hang. We need a
network power off or network reset packet. Which requires a wake-on-lan
interface.

I'm open to ideas. This is not a simple problem.

ron


Reply via email to