>(I was directed to here, from ZFS, because the problems was identified
>to be not based on ZFS, rather on the boot archive:
>http://www.opensolaris.org/jive/thread.jspa?threadID=98092&tstart=0)
>
>We have around 1 outage per week, in average, and the
>machine(s) don't boot up as one might expect.
>Just today: reboot, and rebooting in circles; with no chance on my side
>to see the 30-40 lines of hex-stuff before the boot process recycles.
>That's already bad
You have a lot of power failures; but the system should be able to
bot. The boot-archive is always in issue, because in my book it needs to
be updated to often. Specifically on systems which run for a long time,
a powerfailure typically requires the bootarchive to be rebuild.
Only the systems who come up after a power failure are those systems
who have been modified not to check the boot archive:
svccfg -s boot-archive setprop start/exec = :true
svccfg -s svc:/system/boot-archive refresh
But your particular case is different because your system doesn't even
load the archive and panics.
>So, let's try failsafe (all on nv_110). No better:
>"Configuring /dev
>relocation error: R_AMD64_PC32: file /kernel/dev/amd64/zfs: symbol
>down_object_opo_relocate failed [not fully correctly noted on my side]
This point to a damaged binary.
>zfs error doing relocations
>Searching for installed OS instances ...
>/sbin/install-recovery[7]: 72 segmentation Fault
>no installed OS instance found.
>Starting shell."
>init 6 brought back the failsafe, and there a boot archive was noted as
>damaged, and could be repaired, and the machine restarted after another
>init 6.
So the second boot allowed you to properly boot the failsafe archive;
this is weird and that would also point to a possible hardware issue.
>At earlier boot failures after a power outage, the behaviour was
>different, but the boot archive was recognized as inconsistent a handful
>of times. This bugs me. Otherwise, the machines run through without
>trouble, and with ZFS, the chances for a damaged boot archive should be
>zero. Here it approaches a two-digit percentage.
That's actually not true: the boot-archive becomes inconsistent when, e.g.,
wintertime/summertime commences, a USB device is connected/removed.
(or any other removable device )
That's why I typically use the commands listed above; that's because *I*
prefer the system to boot even when the boot-archive is out-of-date.
>It was pointed out to me, that the problem was a corruption of the
>boot archive by a third party driver.
Which particular driver would this be?
>My questions/suggestions are:
>
>Ought boot archive not be an independent process, that creates a
>proper backup in case of any modification, from any stupid handling?
>Should a recycling reboot not be noted, if just by a flag (in case we
>have r/w of a drive), including a redirection of the messages into a
>file?
>Should we not keep track of a proper roll-back point to offer to boot
>to in case of failing/recycling boots? Maybe something like 'last
>successful boot'?
With ZFS this should become possible; this looks like an interest project
in itself. At the end of boot, you would clone the root and record it as
the "last-successful-boot".
Casper