On Fri, 19 Jan 2001, Eric Seppanen wrote:

> Can you provide a realistic case as to why an entire cluster would have
> been loaded with a linuxbios with settings that fail to boot the machine?

yes. I see this happen from time to time. There is a whole class of errors
with sysadmin work that involves one typo. I've seen people remove
/lib/libc.so, or remove passwd, or do other terrible things. I've seen
people load kernels that won't boot, then reboot the machines, then we all
go "oh shucks" or similar expression.  I've also seen this happen on
clusters.  I've seen it happen to a building full of desktop machines,
leaving the only way out manual intervention -- 100 times. If something
can go wrong, it will go wrong, and it pays to prepare for it. It
consistently amazes me the things tha happen ...

Plus there is the "seems to work here" problem, where you test it, it
seems fine, you apply it, five days later there is this one application
that tickles a problem you could not anticipate. This happens too.

We have to have a safe way to load aggressive settings, test for them, and
recover if there is a problem.

> Another idea: linuxbios, as it starts, stores a magic value somewhere
> (say, in CMOS ram) that basically says "I'm booting with agressive
> settings".  Then, when linux hits runlevel 3, you have a userspace app go
> and erase that magic value.

We seem to be in agreement ... I think we're out of email sync :=) See my
other note on CMOS.

> Then, if a system ever fails to boot with agressive settings, you could
> simply power-cycle (or reset the box in any way) and when linuxbios boots
> it can see that the magic value is already present, and knows the previous
> boot must have failed... therefore it uses the safe settings.


I like this.

ron

Reply via email to