Folks -
I'm preparing to submit the attached PSARC case to provide better
support for device removal and insertion within ZFS. Since this is a
rather complex issue, with a fair share of corner issues, I thought I'd
send the proposal out to the ZFS community at large for further comment
before submitting it.
The prototype is functional except for the offline device insertion and
hot spares functionality. I hope to have this integrated within the
next month, along with the next phase of FMA integration. Please
respond with any comments, concerns, or suggestions.
Thanks,
- Eric
--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
1. INTRODUCTION
Currently, ZFS supports what is affectionately known as poor man's
hotplug. If a device is removed from the system, then it is assumed
that upon I/O failure, an attempt to reopen the same device will fail.
This will trigger a FMA fault, substituting a hot spare if available.
This is undesirable for two reasons:
- There is no distinction between device removal and arbitrary failure.
If a device is removed from the system, it should be treated as a
deliberate action different from normal failure.
- There is no support for automatic response to device insertion. For a
server configured with a ZFS pool, the administrator should be able to
walk up, remove any drive (preferably a faulted one), insert a new
drive, and not have to issue any ZFS commands to reconfigure the pool.
This is particularly true for the appliance space, where hardware
reconfiguration should just work.
This case enhances ZFS to respond to device removal and provides a
mechanism to automatically deal with device insertion. While the
framework is generic, the primary target is devices supported by
the SATA framework. The only device-specific portion of this proposal
concerns determining if a device is in the same physical location as a
previously known device, involve correlating a transport's enumeration
of the device with the device's physical location within the chassis.
2. DEVICE REMOVAL
There are two types of device removal within Solaris. Coordinated
device removal involves stopping all consumers of the device, using the
appropriate cfgadm(1M) command (PSARC 1996/285), and then physically
removing the device. Uncoordinated removal (also known as surprise
removal) is when a device is physically removed while still in active
use by the system. The latter increasingly common as more I/O protocols
support hotplug and higher level software (ZFS) becomes more capable.
There are several ways to detect device removal within Solaris. Fibre
channel drivers generate the NDI events FCAL_INSERT_EVENT and
FCAL_REMOVE_EVENT. USB and 1394 drivers generate the NDI events
DDI_DEVI_INSERT_EVENT and DDI_DEVI_REMOVE_EVENT. In addition to these
event channels, there is also the DKIOCSTATE ioctl() which returns (on
capable drivers) DKIO_DEV_GONE if the device has been removed.
Of these, the ioctl() is the most widely supported, and is the mechanism
used as part of this case. Since this is an implementation detail of
the current architecture, it does not preclude using alternate
mechanisms in the future. When an I/O to a disk fails, ZFS will query
the media state by the DKIOCSTATE ioctl. If the device is any state
other than DKIO_INSERTED, ZFS will transition the device to a new
REMOVED state. No FMA fault will be triggered, and a hot spare (if any)
will be substituted if available. Note that the DKIO_DEV_GONE can be
returned for a variety of reasons (pulling cables, external chassis
being powered off, etc). In the absence of additional FMA information,
it is assumed that this is intentional administrative action.
As part of this work, lofiadm(1M) will be expanded to include a new
force (-f) flag when removing devices. Combined with the upcoming lofi
devfs events (PSARC 2006/709), this will provide a much simpler testing
framework without the need for physical hardware interaction. When this
flag is used, the underlying file will be closed, any further I/O or
attempts to open the device will fail, and DKIOCSTATE will return
DKIO_DEV_GONE. This flag will remain private for testing only, and will
not be documented.
An example of this in action:
# lofiadm -a /disk/a
/dev/lofi/1
# lofiadm -a /disk/b
/dev/lofi/2
# lofiadm -a /disk/c
/dev/lofi/3
# zpool create -f test mirror /dev/lofi/1 /dev/lofi/2 spare /dev/lofi/3
# while :; do touch /test/foo; sync; sleep 1; done
[1] 100662
# zpool status
pool: test
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
test ONLINE 0 0 0
mirror ONLINE 0 0 0
/dev/lofi/1 ONLINE 0 0 0
/dev/lofi/2 ONLINE 0 0 0
spares
/dev/lofi/3AVAIL
errors: No known data errors
# lofiadm -d /disk/a -f
# zpool status
pool: test
state: DEGRADED