I am sponsoring this project as a fast-track on behalf of Jan
Setje-Eilers and John Johnson. The case timer is set to 8/7/2007.
The project desires patch/update binding.
(For those who have already seen this, this re-introduction is
to transition the case from closed to open.)
thx,
-jg
#ident "@(#)design 1.2 07/07/31 SMI"
New Solaris SPARC Boot Architecture
Summary
The Solaris SPARC bootstrap process is being redesigned, both to increase the
commonality with Solaris x86, and to enable ITUs (install time updates) on
SPARC. Secondary goals are to combine inetboot and wanboot into a single
network boot architecture, and to provide a simplified architecture for disk
filesystems other than UFS (e.g, ZFS).
The Solaris x86 newboot project has already been delivered to Nevada and S10u1.
That project was described by PSARC 2004/454. This project is a follow-on to
that project.
This design specification only covers the architecture for Solaris SPARC. For
details of the Solaris x86 design, see PSARC 2004/454.
1. Introduction
---------------
The Solaris boot process was designed in early 1990's for desktops which were
tiny by today's standards. On the one hand, the design was constrained by a
relatively small amount of system memory. On the other hand, the design took
advantage of the presence of Open Boot Prom (OBP) on all Sun platforms. The
resulting implementation involves a complex sequence of control handoff between
kernel and OBP to load a minimum amount of text and data from the root device
into memory.
2. Motivation
-------------
A change in the Solaris boot architecture is motivated by the following
problems:
2.1 Supporting new hardware
Solaris SPARC does not support ITUs, so a platform that requires any Solaris
changes requires a full Solaris update release. This requirement causes pain
for both the Solaris and SPARC HW groups, as the two must fight over Update
schedules for almost every HW platform.
2.2 Commonality with Solaris x86
The newboot project on Solaris x86 has made both the administrative and boot
processes visibly different between Solaris SPARC and Solaris x86. This is in
part due to the addition of a boot archive to Solaris x86. It is in Sun's
interest that the look and feel of Solaris on different platforms be as common
as feasible.
2.3 Ease of booting different FS types
The current boot architecture requires multiple filesystem readers, making
adding
new ones a difficult process. The goal of this project is to only require
one phase to read the root FS before the kernel mounts it. This will enable
new filesystem types like ZFS to be more easily supported as root filesystems.
2.5 Common network boot process
The Solaris SPARC network boot process is very different between booting over a
LAN versus booting over a WAN. The WAN process boots via a miniroot that looks
much like how Solaris x86 installs over a network, but there is no commonality
in how the miniroots are created or administered. This project will unify
SPARC network booting around a single architecture.
3. Proposed Architecture
------------------------
3.1 Boot Phase Independence
The primary design center is to make the phases of the boot process be
independent of each other. This allows the addition of new features (e.g., new
filesystem types) without requiring changes to multiple parts of the boot chain.
This project envisions 4 phases of booting on SPARC machines:
OBP
The OBP phase of boot is unchanged. In fact, it's a requirement that SPARC
newboot not require new OBP functionality. OBP will continue to load and
execute a booter from a disk or network device.
booter
The booter phase is responsible for reading in the boot archive and executing
it. This is the only phase of the boot process that requires knowledge of the
root filesystem format.
ramdisk
The ramdisk is a boot archive containing either kernel modules or an install
miniroot. This boot archive is the same boot archive as is used on Solaris x86.
Its FS format is private to itself. i.e., neither the booter nor the kernel
needs to know whether the archive is HSFS or UFS (or ZFS for that matter). The
ramdisk will extract the kernel image from the boot archive and execute it.
In order to minimize the size of the ramdisk, in particular the
install miniroot, which must reside in memory, the contents of the
miniroot will be compressed. This compression is on a per file level
and is implemented within the filesystem. In order to create
compressed files a userland utility is used that simply compresses the
file in place, the file is then marked as compressed via the
_FIO_COMPRESSED (private) ioctl in the filesystem metadata.
kernel
The final stage is the kernel. The kernel extracts the rest of the primary
modules from the boot archive, initializes itself, mounts the real root file
system, and throws away the boot archive. This process is also the same as on
Solaris x86.
3.2 Removal of bootops (and ufsboot)
The bootops vector on SPARC originally existed to support platforms with either
OBP or SUNMON FW. The second stage booter (e.g., ufsboot) presented a common
set of operations to the kernel so the kernel didn't need to know either what
prom version was running or what filesystem type root was. The last Solaris
release to support a SUNMON platform was Solaris 2.4, and the last one to
support pre-ieee1275 OBP was Solaris 7, so it's time for the bootops to go
gently into that good night.
3.2.3 unix and krtld combined
The Xen project (PSARC 2006/260) has combined the unix and krtld modules in
order to enable booting from either BIOS or the Xen hypervisor. Since combining
these two makes the unix ELF header far simpler to parse and load, this project
has adapted this change for SPARC.
3.2.4 boot properties
There are three properties retrieved via the current bootops that neither OBP
nor the kernel have any knowledge of. These are:
"fstype" name of root filesystem type (e.g., "ufs")
"impl-arch-name" platform name (aka `uname -i`)
"whoami" file system name of booted kernel
A new node (/packages/boot-properties) will be added to the OBP device tree by
the booter which contains these properties.
In addition to the above compatibility properties, a couple of new properties
are
needed:
"bootarchive" boot archive path (as opposed to "bootpath", the root
file system path)
"elfheader-address" address of kernel ELF header, used by krtld in lieu of
the bootaux vector passed up via the second-stage booter
"elfheader-length" length of kernel ELF header
"archive-fstype" fstype of boot archive
3.3 Differences from Solaris x86
The primary difference from PSARC 2004/454 is that there is no dependence on
grub. This decision was made for both practical and functional reasons. The
practical reason is that grub0.95 - which was used by Solaris x86 - is not
available on SPARC. The functional reason is that most of the reasons grub was
used for Solaris x86 (e.g., eliminate real-mode, eliminate the boot shell, 3rd
party device support) are not applicable to Solaris SPARC. Making grub work in
the network boot case on SPARC is not a trivial exercise, as grub does not
currently support NFS or HTTP, and adding wanboot's non-exportable cryptography
code is problematic with respect to the GPL.
The boot menu provided by grub would be an interesting feature to provide
Solaris
SPARC users. If grub2 were to become available on SPARC in a stable supportable
release it could be incorporated in booter phase above. This should not require
much change to existing code since the booter retains its ability to load
alternate secondary booters such as cprboot.
4 Interfaces
------------
4.1 Interface Exported
-----------------------------------------------------------------------
Name Level Comments
-----------------------------------------------------------------------
Boot files Evolving
platform/<platform>/boot_archive default name of boot archive
(initially may be sun4u/sun4v only)
Boot args
kernel args Evolving see kernel(1M)
-F <alternate file> Evolving alternate booter or boot archive
usr/sbin/bootadm(1M) Stable
usr/sbin/root_archive(1M) Stable
boot/solaris/bin
create_ramdisk Proj. Priv.
Install-time Update different user prompts
boot/solaris
filelist.ramdisk Proj. Priv. boot archive content
filestat.ramdisk Proj. Priv. boot archive file status
kernel/fs/dcfs Proj. Priv. compression file system
sbin/fiocompress Proj. Priv. file compression utility
usr/sbin/fiocompress Proj. Priv. link to sbin/fiocompress
usr/include/sys/fs/decomp.h Proj. Priv. dcfs header file
_FIO_COMPRESSED Proj. Priv. file compression ioctl
usr/platform/sun4[uv]/lib/fs
hsfs/bootblk Proj. Priv. filesystem readers
zfs/bootblk Proj. Priv. filesystem readers
platform/sun4[uv]/ufsboot Proj. Priv. removed
kernel/misc/sparcv9/krtld Proj. Priv. merged into unix
-----------------------------------------------------------------------
4.2 Interfaces Reimplemented
-----------------------------------------------------------------------
Name Level Comments
-----------------------------------------------------------------------
reboot(1M) Stable? update boot archive as needed
halt(1M)
poweroff(1M)
shutdown(1M)
init(1M)
pkgadd(1M)
patchadd(1M)
add_install_client Evolving modified server setup
smdiskless(1M) Evolving setup /tftpboot area
Install consume bootadm and installgrub
Upgrade
Live Upgrade
Flash Install
Net Install
Jumpstart
Release engineering tools
-------------------------
modified miniroot construction
5. User Experience
------------------
5.1 System startup
The system startup changes are not visible to the user.
5.2 Installation and Upgrade
The ITU menu which currently only exists on x86 may be exposed on sparc as
well.
Customers who rely on undocumented non-interfaced implementation details
of add_install_client may need to amend their procedures.
If we deliver the wanboot component, wanboot customers will see a
simplification of the deployment procedures.
5.3 Internal tools
The bfu scripts will be updated to permit developers to transition from old boot
to new boot. For a system initially installed with old boot, it will be possible
to bfu back and forth across the boundary. For a system initially installed with
new boot, bfu back to old boot is not supported. Booting a glommed kernel is
supported, with the same restrictions regarding the compatibility of userland
and kernel.
If the root file system goes into a non-bootable state, a user may boot
the failsafe archive to perform manual recovery operations. The failsafe
archives contains files normally present in a CD or netinstall miniroot.
If the boot archive goes into a non-bootable state, a user may bypass the boot
archive and directly boot the kernel with the -F <kernel FS path> option.
5.4 Coexistence with other OSes
The only other OS of note for SPARC is Linux, and they currently have no grub
plans. SPARC / Linux boots from a loader called SILO, which already has a
grub-like menu facility.
In general OBP is considered well suited to loading multiple OS's or versions
of OS's from various devices on the system, so no additional work beyond what
is provided by OBP/the system is desirable.
6. Technical Details
--------------------
6.1 OBP phase
This project does not change the OBP phase of SPARC boot; this section is
included for reference only.
When a user types "boot" on an OBP-based system, the device selected - either
from the command line or via the "boot-device" nvram variable - has its "open"
and "load" methods called. The program loaded by this process is then executed,
and the boot process enters the booter phase.
6.1.1 Disk
For disk devices, the FW driver usually uses the OBP label package's "load"
method, which parses the VTOC label at the beginning of the disk to locate the
specified partition, then reads sectors 1-15 of that partition into memory. This
area is commonly called the boot block and usually contains a filesystem reader.
6.1.2 Net
For network devices, the process is slightly different between booting over a
LAN versus booting over a WAN. In both cases, however, the prom will download
a booter from a boot or install server (inetboot in this case).
6.1.2.1 LAN boot
When booting over a LAN, the FW uses either RARP and BOOTP or DHCP to discover
its boot or install server. It then uses TFTP to download the booter (inetboot
in this case).
6.1.2.2 WAN boot
When booting over a WAN, the FW uses either DHCP or nvram properties to discover
its install server, and the router and proxies needed to connect to it. It then
uses HTTP to download the booter, and may optionally check the booter's
signature
with a predefined private key. For more details, see PSARC 2001/009.
6.2 Booter phase
This phase is derived from the SPARC wanboot ramdisk process (see PSARC
2001/009), and is responsible for reading the boot archive from the root file
system (or install server in wanboot's case) into a ramdisk device. It does
this by:
1) opening the boot-device (which it found as the "bootpath" property in the OBP
"/chosen" node)
2) using its file system specific reader to read the boot archive (by default,
/platform/`uname -m`/boot_archive)
3) creating a ramdisk device in "/ramdisk"
4) creating "bootarchive" and "fstype" properties in "/packages/boot-properties"
5) booting the archive (a ramdisk is just another type of disk, so executing the
boot block area serves this purpose)
6.3 Ramdisk phase
The ramdisk is self-describing in the same sense any disk image is by virtue of
having a filesystem reader in its boot block. This reader is over 90% the same
as the disk boot block for a given filesystem so the same program is re-used.
Its job is to load and execute the kernel from the archive by:
1) opening the boot archive (the "bootarchive" property from the previous phase)
2) using its file system specific reader to read the kernel (by default,
/platform/`uname -i`/kernel/unix)
3) creating "impl-arch-name", "whoami" and "elfheader" properties
4) executing the kernel
6.4 Kernel phase
When krtld gains control, it mounts the boot archive and loads additional kernel
modules from the boot archive via the ramdisk. Subsequent kernel initialization
procedures remain the same until after the kernel mounts the root file system.
At that point, the kernel throws away the boot archive and reclaims the memory
it occupies. Note that in the install case, the ramdisk actually contains the
root file system, and is not thrown away. The kernel ramdisk driver simply
takes over control of the ramdisk image.
6.5 Chained booters
When booting from a disk, the booter will support chained booters both for
cprboot (see PSARC 1992/201) and for situations where the file system reader
cannot fit in the boot block. In a future project, grub2 can use this facility
to add a graphical user menu to the booter phase. This facility will not be
available when booting from a network, since the chained booter usually uses the
same virtual address space as the original booter.
6.6 Boot archive management
There are two kinds of boot archive: failsafe and normal. A failsafe archive is
self-sufficient and bootable by itself. It is created at install time and
requires no maintenance. A normal archive shadows a root filesystem, so it
contains all kernel modules, driver.conf files, and a few configuration files in
/etc which are read by the kernel before root is mounted. Once the root
filesystem is mounted, the kernel discards the boot archive from memory and
file I/O will be performed against the root device.
By default, the normal archive contains the following files and directories:
etc/system
etc/name_to_major
etc/driver_aliases
etc/name_to_sysnum
etc/dacf.conf
etc/driver_classes
etc/path_to_inst
etc/devices/devid_cache
etc/devices/mdi_scsi_vhci_cache
etc/devices/mdi_ib_cache
etc/cluster/nodeid
etc/zfs/zpool.cache
kernel
platform
The contents under the platform directory will be segregated into those needed
for a sun4u boot archive and those needed for a sun4v boot archive. Further
per-platform differentiated boot archives may be considered if that helps
us gain faster booting via faster archive load, trading off a more complicated
archive construction process.
If any files in this list (or under directories listed) is updated, the boot
archive must be rebuilt prior to the next reboot for the modification to take
effect. The package and patch tools are updated to update the boot archive
whenever needed. In addition, the boot archive is updated as necessary on an
orderly system shutdown to catch files modified manually.
The boot archive could be out of sync with the root filesystem if the system
panics in the middle of an update, but before archive update is completed. We
check for such conditions on every boot before root filesystem is mounted
writeable. If an inconsistency is detected, the system will stop in single-user
mode, similar to current behavior when fsck fails on the root filesystem. The
recommended recovery method is to boot the failsafe archive and recreate the
boot_archive. An expert user may decide to continue booting if the out-of-sync
files are not critical.
As noted above, if the boot archive cannot be fixed, the kernel can be directly
booted from the boot via the -F <kernel> argument. This option will not work
when booting via HTTP, since there is no underlying FS to read from. It will
also be noticably slower than booting via the archive, since the relatively
simple booters do not implement a buffer cache.
The bootadm(1M) command will handle the details of archive update and
verification.
6.7 Install and upgrade
Normal install and upgrade is achieved by booting the miniroot from either
CDROM/DVD or from the network. In both cases, the root filesystem of the
miniroot is the ramdisk. This allows the Solaris boot CD to be ejected without
rebooting the system. The boot archive contains the entire miniroot.
The construction of the install CD is modified to use an hsfs boot block. The
miniroot is packed into a single file in ufs format, to be loaded as the ramdisk
image.
The setup of the net boot server is also modified. The boot server will serve a
boot strap as well as the ramdisk image which is downloaded and then booted
from.
The netinstall image will be packed using root_archive(1m).
The process for installing the OS to disk remains the same except that the boot
blocks are different and a boot archive must be constructed prior to booting the
install target disk. The boot archives are created using bootadm(1M). The rest
of code changes comes with packages and patches and no special treatment is
required.
6.8 Diskless clients
Diskless boot is similar to booting the miniroot for net install
except that the root filesystem is on NFS instead of on UFS.
6.9 Install-time Update (ITU)
New-boot (phase 1: x86) reduced the x86 ITU mechanism described in PSARC
1997/059 to simply adding Solaris binaries (drivers, kernel modules, commands,
libraries, symlinks and the like) to the running miniroot and then the target
install environment. It also extended the supported media that an ITU could
be supplied on to include CD/DVD, memory sticks and similar devices.
This mechanism will now be made available on sparc as well to allow platform
support to be delivered out of band of a regularly scheduled release.
If a core kernel component needs to be updated, say unix or something else that
needs to be loaded before an ITU can be added, then a pre-patched miniroot image
needs to be made available (50Mb download) along with the patch.
References
----------
1. Shudong Zhou PSARC 2004/454 Solaris Boot Architecture
2. Carl Smith PSARC 2001/009 WAN-boot
3. Clark Dong PSARC 1992/201 Checkpoint Resume (Reanimator)
4. Allan McKillop PSARC 2006/260 Solaris on Xen