2006/525: New Boot Sparc

James Carlson Thu, 02 Aug 2007 09:39:30 -0400

Jan Setje-Eilers writes:
>  You seem to be arguing that we can come up with old data without 
> rolling/committing logs. If that's acceptable, then we can some up 
> with the old archive as well and can just eliminate the check.


I don't think so.

The effects of the boot archive are quite surprising to administrators
because they have a much longer delay time.  If I edit /etc/system
(or, worse, perform some action that has the side-effect of editing
that file), then my expectation is that unless the system goes down
unexpectedly *right away*, those bits will be on the disk "soon."

I have an expectation that after a few seconds (and perhaps a few
mumbled superstitious 'sync' invocations), everything that I've
changed administratively is stable again.  User data in flight may be
discarded if the system crashes while it's in flight, but the OS
itself is stable.  That's true on the current SPARC systems, but not
true once we have a boot archive containing volatile files.

In that case, I've got an extra step to perform: regenerating the boot
archive after doing some edits.  On x86, because I can't easily know
when these sorts of changes happen, I've taken to uttering "bootadm
update-archive -v" every now and then, just on spec.  Every once in a
while, it catches something surprising.  (Particularly so in the first
reboot after an upgrade -- something about the boot process almost
always tweaks /etc/system or some famous file, meaning that right
after upgrade, my system is just _always_ in an unstable state.)

I suspect that some customers are using cron jobs for similar effect.

It's like the bad old days with "sync ; sync ; sync" ... wait for it
... "reboot."  It causes users to distrust the system.

>  I get the impression that you're placing some value on how old the
> data is. However in the case of a non-interfaced binary kernel
> component that really doesn't matter. It only matters if it's 
> compatible with the rest of the bits or not. So either way we'd need 
> the check.

Yes, part of it is a concern over how old the data in the archive are.
The other part is the effect of failure: when this happens, the
machine is stuck in boot.  Unless you've got access to the console,
and realize what's happened, the machine is just a warm brick.

Moving the volatile files out of the archive limits the scope of the
problem.  At that point, it's _only_ intentional packaging changes
that could affect the consistency of the boot archive, and we could
devise some simple way to make sure that those changes get committed
to the archive.  With volatile files in the archive, it's much more
wide-open, and more exotic (and I think unlikely) schemes such as FEM
would be needed.

-- 
James Carlson, Solaris Networking              <james.d.carlson at sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

2006/525: New Boot Sparc

Reply via email to