I am sponsoring the following fasttrack for myself, requesting micro/patch
binding and a timeout of 2/13/2008.
-Chris
Template Version: @(#)sac_nextcase 1.64 07/13/07 SMI
This information is Copyright 2008 Sun Microsystems
1. Introduction
1.1. Project/Component Working Name:
Multiplexed I/O Enhancements to Support FMA
1.2. Name of Document Author/Supplier:
Author: Chris Horne
1.3 Date of This Document:
01 February, 2008
4. Technical Description
4.1 Problem
This fasttrack covers new Multiplexed I/O [1] (MPXIO) related
interfaces to support storage FMA efforts. FMA needs the
following:
A) When a command associated with a scsi_pkt(9S) completes, it
must contain information about what physical hardware path
was involved in processing the request.
B) If an ereport is generated, it must contain a dev scheme
FMRI representation of the physical hardware path that
persists across reboot, and is independent of mpxio
enable/disable.
C) If an ereport class is associated with a storage device
acting as an error detector, the ereport must contain
path-independent device identity information (devid). For
such device-as-detector errors, a diagnostic engine
(eversholt) is expected to map the ereport to the fmd(1M)
libtopo storage topology [11] using the devid.
D) If an ereport class is associated with a transport failure,
fma code must be able to model the configuration and have
APIs available to further explore the fault boundaries
related to specific paths to storage.
Addressing these problems involves changes to the following
areas of the ON code base: libdevinfo(3LIB), scsi_pkt(9S),
uscsi(7i), mdi(9F), scsi_vhci(7d), fmd(1M),
ddi_fm_ereport_post(9F), and fmdump(1M).
4.2 Proposal
The topology of a given storage configuration is the same
independent of whether mpxio is enabled or disabled.
Supporting a device under mpxio effects the interfaces used to
discover, report, and explore the topology, but the fundamental
topology itself does not change. With this in mind, it is
important to be able to 'name' things independent of mpxio.
For paths, this can be done by recognizing that an mdi pathinfo
node, and its libdevinfo path node counterpart, can be
represented by the same path string that would be used if the
device was enumerated using a devinfo node.
This proposal defines the string representation of a path to a
pathinfo node, introduces the concept of "path_instance"
associated with these paths, and provides new APIs to expose
and utilize these two concepts.
The proposal also addresses some inconsistencies in the
libdevinfo(3LIB) path node interfaces, and seeks to promote the
'path' node interfaces defined by [1] and 'devlink' libdevinfo
interfaces defined by [6,13,14] from consolidation private to
evolving.
The proposal also covers a minor enhancement to the eversholt
language and run-time environment as well as a new output
filter option for fmdump(1M).
4.2.1 Path String and Path Instance
Each devinfo node in the device tree can be uniquely
represented by a path_string, with a separate
"/node_na...@unit_addr]" path_string component for each level
in the tree. For each devinfo node that succeeds attach(9E),
the I/O framework persists a unique "instance to path_string"
mapping (/etc/path_to_inst file, includes driver name too).
A pathinfo node is a peer of a devinfo node, but currently
lacks the formal definition and supporting interfaces to make
this obvious. Just like a devinfo node, a pathinfo node has a
"node_name" (inherited from the mpxio 'client' node) and a
"unit_address" (_bus_addr). Given this, just like a devinfo
node, a pathinfo node can have a path_string representation and
a unique path_instance associated with the path_string.
For a given device, the path_string representation of a
pathinfo node is identical in value to the devinfo path_string
had the device *not* been enumerated under MPXIO.
The path_instance mechanism proposed will persist the
"path_instance to path_string" mapping across the
destruction/recreation of a pathinfo nodes (and across
detach(9E)/re-attach(9E) of any parent) - keeping the
path_instance value valid beyond any locking scope. However,
unlike a devinfo node instance, a path_instance does not
persist across reboot. The path_instance of a path node is
available in a di_init(3DEVINFO) snapshot, and can be used to
direct device access down a specific path using the proposed
uscsi(7I) extensions.
The fact that the devinfo/pathinfo path-string representation
remains the same independent of MPXIO enable/disable, and that
it provides a physical hardware orientation make it an ideal
enhancement to current dev scheme FMRI representation used in
ereports.
NOTE: While a pathinfo node is similar to a devinfo node
relative to path-string representation, there are also many
distinguishing features such as: a driver does not bind to a
pathinfo node, and a pathinfo node does not have minor_nodes
associated with it.
NOTE: Past issues around different unit-address representation
for devinfo.vs.pathinfo have been resolved (4953227,6274205).
NOTE: If needed, the path_instance mechanism could be extended
in the future to apply to devinfo nodes too - providing a way
to get the ddi_devpathname() of a device independent of the
state of the devinfo node. One motivation for doing this would
be if the 1K of stack space currently used
ddi_fm_ereport_post(9F) in the nosleep code path is ever poses
a stack-overflow problem. A devinfo path_instance would remove
the need for that 1K (MAXPATHLEN) stack space. In addition the
path_instance could be extended to keep both device path and
device identity - so that different devices seen at the same
path end up with a different device_instance. This is would be
useful for device paths that don't include identity information
in the unit-address of the final path component. A change in
the path_instance of a node might indicate the need to fence
the access until the identity change is properly coordinated.
4.2.2 Problems with scsi_pkt(9S)
The scsi_pkt(9S) structure is the primary mechanism used for
communication between a SCSA target driver and a SCSA HBA
driver. This proposal extends the scsi_pkt(9S) structure to
include path_instance.
From the target driver perspective: at command issue time a
non-zero pkt_path_instance requests a command to be sent down a
specific path (zero indicates vHCI selects a path), and at
pkt_comp() time the pkt_path_instance communicates which path
in fact used selected. This means that the pkt_path_instance at
pkt_comp() time can be used to generate the FMRI of the actual
hardware path used. At command issue time the pkt_path_instance
field allows the target driver to select a specific path: one
of its own choosing, or the specific path indicated by a user
level application using the uscsi(7I) extensions proposed.
The DDI does not allow a driver to allocate it's own
scsi_pkt(9S), a driver should not have *any* compiled in
dependencies on "sizeof (struct scsi_pkt)": a driver that
violates these rules limits SCSA's ability to evolve. The
scsi_pkt allocation rules have been in place for many (>10)
years, unfortunately a significant number of drivers are still
broken - making the scsi_pkt structure difficult to extend.
As part of this work SCSA will be enhanced to detect HBA
drivers with scsi_pkt(9S) allocation violations - printing a
message like
WARNING: mpt: violates DDI scsi_pkt(9S) allocation rules
for each driver found in violation (one per boot).
CRs were filed against some broken HBA drivers, a long time ago
(based on code inspection). Many of these CRs were just closed
without ever addressing the problem (and some of the drivers
may be EOLed at this point).
http://monaco.sfbay.sun.com/detail.jsf?cr=5039931,5039932,5039934,5039935,5039936,5039937,5039938,5039941,5039942
For initial nevada putback, the message above will be displayed
(once per driver) for debug kernels. If we are unable to get
enough drivers fixed, we may need to disable the messages
completely.
The best way of fixing scsi_pkt allocation violations is to
change an HBA driver to use the tran_setup_pkt(9E) interfaces
defined by [3,4]. If this proves difficult, we may need to
implement a scsi_pkt_size(9F) peer of the buf(9S) biosize(9F)
interface.
To implement this a new scsi_pkt_allocated_correctly()
interface is provided. While HBA drivers are being fixed,
access to pkt_path_instance must be conditioned by calls to
scsi_pkt_allocated_correctly(). For maximum flexibility, HBA
drivers that enumerate under scsi_vhci (fcp, iscsi, mpt, ibsrp)
be fixed first.
4.2.3 Uscsi enhancements allow path selection by path_instance
This proposal enhances uscsi(7I) to support path_instance based
path steering via a new uscsi_path_instance "struct uscsi_cmd"
field and a new USCSI_PATH_INSTANCE uscsi_flags bit. To
preserve the last remaining uscsi_cmd field (uscsi_reserved_5)
for future expansion, the input-only uscsi_path_instance field
overlays the current output-only uscsi_resid field. The common
scsi_uscsi_alloc_and_copyin() interface (6451061) is enhanced
to only allow USCSI_PATH_INSTANCE operation to the
scsi_vhci(7D) HBA, and a new scsi_uscsi_initpkt() is defined to
isolate scsi_pkt_alloced_correctly() use. This is a compatible
enhancement to uscsi(7I): the uscsi_path_instance field is only
considered valid if the new USCSI_PATH_INSTANCE bit is set.
The MPAPI implementation of path steering provided by [2] makes
sense for MPAPI applications, but its existence does not
preclude the implementation of other path steering mechanisms.
The path steering mechanism provided by this case will be
easier to use for di_init(3DEVINFO) snapshot consumers.
4.2.4 ddi_fm_ereport_post/fm_dev_ereport_postv() interface enhancements
The device tree is not represented by a single data structure
with embedded type information - the way vnodes are. Instead,
the device tree is composed of a number of different data
structures linked together: devinfo nodes (directories VDIR),
minor nodes (files VBLK/VCHR), and pathinfo nodes (a bit like a
hardlink VLNK).
The current ddi_fm_ereport_post(9F) interface is limited to
just the concept of a devinfo node in conjunction with an
fma_capable driver bound to the node. The proposal extends the
set of dev scheme *_fm_ereport_post interfaces to cover the
other types of nodes in the device tree and to better support
nexus-child relationships.
Nexus driver interfaces (ndi_*) are all private. When exposure
is necessary, nexus concepts are abstracted via DDI
interfaces. For example the scsi_device(9S) data structure is
an abstraction of a leaf target-driver devinfo child below a
SCSA HBA nexus driver.
For fma, we need basic building-block ddi_fm_*() interfaces
that can be leveraged when adding fma support to abstractions.
This proposal delivers a private fm_dev_ereport_postv()
'va_list' building-block interface, and converts
ddi_fm_ereport_post() to use fm_dev_ereport_postv(). The
fm_dev_ereport_postv() interface can support both ddi_, ndi_,
and abstracted callers. The abstracted callers may understand
pathinfo node operations and device identity.
4.2.5 Eversholt
This fasttrack also adds a minor feature to the "Eversholt"
language used to describe fault trees [8,9]. The change is
required for disk FMA work [11] and will not cause any
compatibility issues. The stability of the Eversholt language
remains Sun Private and the release binding for the changes
described here is micro/patch.
On approval, new version of the "Eversholt Language Manual"
will be available [9]. The change adds a new optional property
to Section 2.3.1.5 "Error Report Events" called
'discard_if_config_unknown', with the following additions to
the associated table and text:
Property Required or Allowed
Optional Types
discard_if_config_unknown Optional Integer
The 'discard_if_config_unknown' property, when given a
non-zero value, tells the run-time environment that a failure
to associate an event with the current configuration should
result in the event being silently discarded.
In addition to the Eversholt language change, the fmd eversholt
run-time environment is enhanced to support the new property,
and to implement ereport-configuration match based on devid.
These changes allow kernel generated ereports that include
devid information to map to the configuration topology based on
devid. An ereport defined with 'discard_if_config_unknown' that
fails to find a configuration mapping will silently be
discarded. This allows us to support structured error logs
(fmdump -e) on machine configurations where the full storage
topology is not yet represented.
4.2.6 Prtconf output
Prtconf output is changed to show path_instance and path_string
(+ output below).
Paths from multipath bus adapters:
+ Path 3: /pci at 8,600000/pci at 1/SUNW,qlc at 5/fp at 0,0/ssd at
w50020f230000826a,4
fp#6 (online)
...
+ Path 10: /pci at 8,600000/pci at 1/SUNW,qlc at 4/fp at 0,0/ssd at
w50020f230000826a,4
fp#4 (online)
...
+ Path 17: /pci at 8,700000/SUNW,emlxs at 3,1/fp at 0,0/ssd at
w50020f230000826a,4
fp#2 (online)
...
+ Path 24: /pci at 8,700000/SUNW,emlxs at 3/fp at 0,0/ssd at
w50020f230000826a,4
fp#0 (online)
...
4.2.7 fmdump filter enhancement
The fmdump(1M) CLI introduced by [12] is enhanced to provide a
new '[-n name[.name]*[=value]' output filter option. This
filter option works on both fault logs (fmdump) and error logs
(fmdump -e). It filters based on ereports having properties
with the specified name as well as that property having the
specified value. For string properties the value can be a
regular expression. Support for embedded nvlist property
filtering is provided by using a name that that crosses
multiple levels, each level is separated by a '.'. Examples:
# fmdump -e | wc -l
5
# fmdump -en detector.devid | wc -l
4
# fmdump -en \
'detector.devid=id1,sd at THITACHI_DK32EJ-36NC_____433H8282' | wc -l
2
# fmdump -en 'detector.devid=.*HITACHI.*' | wc -l
2
4.4 Interface Tables
------------------------------------------------------------------------
INTERFACES BEING REMOVED Old
Interface Level Comments
------------------------------------------------------------------------
libdevinfo(3LIB):
di_path_phci_path Cons.Priv. ARCed and defined in
libdevinfo.h, but never
implemented. It is
better to match
di_devfs_path()
structure with new
di_path_devfs_path()
below
di_path_client_path Cons.Priv. same, see
di_path_client_devfs_path()
below.
di_path_addr Cons.Priv. Currently unused.
Mismatch with di_bus_addr()
peer. See di_path_bus_addr()
below.
------------------------------------------------------------------------
INTERFACES BEING RENAMED Old Level Comments
AND PROMOTED New Level
Interface
------------------------------------------------------------------------
libdevinfo(3LIB):
(old)di_path_next_phci-> Cons.Priv. Confusing name: caller
(new)di_path_client_next_path starts with a *client*
Evol. dip (not a phci dip)
and iterates through
paths associated with
*client*. Also, a
client can have
multiple paths
associated with the
same phci.
(old)di_path_next_client-> Cons.Priv. Confusing name: caller
(new)di_path_phci_next_path starts with a *phci*
Evol. dip (not a client) and
iterates through paths
associated with phci.
------------------------------------------------------------------------
INTERFACES BEING PROMOTED
Interface Level Comments
------------------------------------------------------------------------
libdevinfo(3LIB): (from PSARC/1999/647 [1]: Cons.Priv. -> Evol.):
DINFOPATH Evol. di_init() flag to snapshot path
get . associated with path
di_path_client_node Evol. .client
di_path_phci_node Evol. .phci
get path property as . array
di_path_prop_bytes Evol. .byte
di_path_prop_int64s Evol. .int64
di_path_prop_ints Evol. .integer
di_path_prop_strings Evol. .string
look for property as . array
di_path_prop_lookup_bytes Evol. .byte
di_path_prop_lookup_int64s Evol. .int64
di_path_prop_lookup_ints Evol. .integer
di_path_prop_lookup_strings Evol. .string
di_path_prop_name Evol. get name of path property
di_path_prop_type Evol. get type of path property
di_path_prop_next Evol. walk path properties
di_path_state Evol. get path state
DI_PATH_STATE_FAULT Evol. failed, not currently in use
DI_PATH_STATE_OFFLINE Evol. not available to rcv/xmit data
DI_PATH_STATE_ONLINE Evol. available to rcv/xmit data
DI_PATH_STATE_STANDBY Evol. up, but not currently in use
------------------------------------------------------------------------
INTERFACES BEING PROMOTED
Interface Level Comments
------------------------------------------------------------------------
libdevinfo(3LIB): (PSARC/2000/310 [13]: Cons.Priv. -> Evol.):
(PSARC/2002/239 [14]: Cons.Priv. -> Evol.):
di_devlink_init() Evol. Obtain a snapshot of
the devlink database.
di_devlink_handle_t Evol. opaque handle to snapshot.
DI_MAKE_LINK Evol. Update /dev
di_devlink_fini() Evol. Destroy snapshot.
di_devlink_walk() Evol. Walk links in snapshot.
di_devlink_t Evol. opaque handle to devlink.
di_devlink_path() Evol. Get devlink path.
di_devlink_content() Evol. Get devlink contents.
di_devlink_type() Evol. Get devlink type.
DI_PRIMARY_LINK Evol. devlink to /devices
DI_SECONDARY_LINK Evol. devlink to devlink
di_devlink_dup() Evol. Copy a devlink object
di_devlink_free() Evol. Free a devlink object
------------------------------------------------------------------------
NEW INTERFACES
Interface Level Comments
------------------------------------------------------------------------
libdevinfo(3LIB):
di_path_node_name() Evol. di_path_t peer of
di_node_t oriented
di_node_name().
di_path_bus_addr() Evol. di_path_t peer of
di_node_t oriented
di_bus_addr(). Also fixing
CR6284426 "di_path_addr
should have its second
argument removed".
di_path_instance() Evol. di_path_t peer of
di_node_t
di_instance().
di_path_devfs_path() Evol. di_path_t peer of
di_node_t
di_devfs_path().
di_path_client_devfs_path()
Evol. di_path_t peer of
di_node_t
di_devfs_path().
di_path_private_get() Evol. di_path_t peer of
di_node_t
di_node_private_get().
di_path_private_set() Evol. di_path_t peer of
di_node_t
di_node_private_set().
di_lookup_node() Cons.Priv. di_path_t peer of
di_node_t
di_lookup_node().
di_lookup_path() Cons.Priv. di_path_t peer of
di_node_t
di_lookup_node().
di_devfs_path_match() Cons.Priv. check to see if two
/devices paths are the
same, ignoring any
generic.vs.non-generic
node name mismatches.
uscsi(7I):
.uscsi_path_instance Cons.Priv. new "struct uscsi_cmd" field
path_instance to send
command down if
USCSI_PATH_INSTANCE
set.
USCSI_PATH_INSTANCE Cons.Priv. new uscsi_flags bit,
send command down
specific path
scsi_uscsi_initpkt() Cons.Priv. New uscsi setup
interface for target
driver implementing
uscsi path_instance.
scsi_pkt(9S):
.pkt_path_instance Cons.Priv. new scsi_pkt(9S) field holding
path_instance.
FLAG_PATH_INSTANCE Cons.Priv. new pkt_flags bit, send
command down specific
path.
FLAG_PATH_INSTANCE_RPT Cons.Priv. new pkt_flags bit,
path_instance reports
path used.
scsi_pkt_pathname(); Cons.Priv. Given scsi_pkt and
scsi_device, return
/devices path_string
used to send command.
mdi:
.pi_path_instance Cons.Priv. pathinfo node's path_instance
MDI_SELECT_PATH_INSTANCE Cons.Priv. New mdi_path_select()
method.
mdi_pi_get_path_instance()Cons.Priv. Return path_instance
given pathinfo.
mdi_pi_pathname() Cons.Priv. pathinfo peer of
devinfo
ddi_pathname().
mdi_pi_pathname_by_instance()
Cons.Priv. Return path_string
given path_instance.
fma:
ndi_fm_ereport_post() Cons.Priv. ndi form of
ddi_fm_ereport_post(9F).
fm_dev_ereport_postv() Cons.Priv. Common implementation
code (and used by
implementation of
scsi_fm_ereport_post()).
eversholt:
discard_if_config_unknown Private optional property of
.esc 'ereport'
declaration.
prtconf(1M):
prtconf -v (output) Not.an. show path_string
Interface
fmdump(1M):
fmdump -n name[.name]*[=value]
Evol. -n filter option.
fmdump -n (output) Not.an.
Interface
4.5 References
[1] Multiplexed I/O Framework
http://sac.sfbay/PSARC/1999/647
http://www.opensolaris.org/os/community/arc/caselog/1999/647
[2] mpxio path steering (MPAPI)
http://sac.sfbay/PSARC/2006/621
http://www.opensolaris.org/os/community/arc/caselog/2006/621
[3] new scsi_hba_tran entry points (scsi_pkt)
http://sac.eng.sun.com/PSARC/2005/680/mail
http://www.opensolaris.org/os/community/arc/caselog/2005/680
[4] scsa dma enhancement (scsi_pkt)
http://sac.eng.sun.com/PSARC/2006/240/mail
http://www.opensolaris.org/os/community/arc/caselog/2006/240
[5] Dev scheme specification - Section 8.4.3
http://fma.eng/documents/engineering/protocol_whtppr.pdf
[6] libdevinfo reimplementation
http://sac.sfbay/PSARC/1997/127/commit.materials/devinfo.pdf
[7] MDI/pHCI/libdevinfo Extensions for SNIA MPAPI support
http://sac.sfbay/PSARC/2005/646
[8] Eversholt Diagnosis Technology
http://sac.eng/PSARC/2003/428
[9] Eversholt Language Manual (Version 1.5 10/04/06)
http://eversholt.central/docs/language/
[10]Generic Topology for Internal Disks
http://sac.sfbay/PSARC/2007/388/mail
http://wikihome.sfbay/fma-portfolio/Wiki.jsp?page=2007.016.DiskTopology
[11]Unified Disk FMA
http://wikihome.sfbay/fma-portfolio/Wiki.jsp?page=2007.015.UnifiedDisk
[12]Solaris Fault Management Daemon
http://sac.sfbay/PSARC/2003/089/
[13]libdevinfo devlinks interfaces
http://sac.sfbay/PSARC/2000/310
[14]Devlink Creation Enhancements
http://sac.sfbay/PSARC/2002/239
4.6 Man page changes
See materials directory in case directory has information for
the following new/changed man pages.
Deliver (Evolving):
di_devfs_path.3devinfo.diff
di_devlink_dup.3devinfo
di_devlink_init.3devinfo
di_devlink_path.3devinfo
di_devlink_walk.3devinfo
di_init.3devinfo.diff
di_lnode_private_set.3devinfo.diff
di_path_info.3devinfo
di_path_next.3devinfo
di_path_prop_access.3devinfo
di_path_prop_lookup.3devinfo
di_path_prop_next.3devinfo
fmdump.1m.diff
libdevinfo.3lib.diff
Eversholt.index.html.diff.txt
Information in Case directory (Cons.Priv.)
uscsi.7i.diff
6. Resources and Schedule
6.4. Steering Committee requested information
6.4.1. Consolidation C-team Name:
ON
6.5. ARC review type: FastTrack
6.6. ARC Exposure: open