I am sponsoring the following fasttrack for myself, requesting micro/patch
binding and a timeout of 2/13/2008.

-Chris


Template Version: @(#)sac_nextcase 1.64 07/13/07 SMI
This information is Copyright 2008 Sun Microsystems
1. Introduction
    1.1. Project/Component Working Name:
         Multiplexed I/O Enhancements to Support FMA

    1.2. Name of Document Author/Supplier:
         Author:  Chris Horne

    1.3  Date of This Document:
        01 February, 2008

4. Technical Description

   4.1  Problem

        This fasttrack covers new Multiplexed I/O [1] (MPXIO) related
        interfaces to support storage FMA efforts. FMA needs the
        following:

        A)  When a command associated with a scsi_pkt(9S) completes, it
            must contain information about what physical hardware path
            was involved in processing the request.

        B)  If an ereport is generated, it must contain a dev scheme
            FMRI representation of the physical hardware path that
            persists across reboot, and is independent of mpxio
            enable/disable.

        C)  If an ereport class is associated with a storage device
            acting as an error detector, the ereport must contain
            path-independent device identity information (devid). For
            such device-as-detector errors, a diagnostic engine
            (eversholt) is expected to map the ereport to the fmd(1M)
            libtopo storage topology [11] using the devid.

        D)  If an ereport class is associated with a transport failure,
            fma code must be able to model the configuration and have
            APIs available to further explore the fault boundaries
            related to specific paths to storage.

        Addressing these problems involves changes to the following
        areas of the ON code base: libdevinfo(3LIB), scsi_pkt(9S),
        uscsi(7i), mdi(9F), scsi_vhci(7d), fmd(1M),
        ddi_fm_ereport_post(9F), and fmdump(1M).

   4.2  Proposal

        The topology of a given storage configuration is the same
        independent of whether mpxio is enabled or disabled.
        Supporting a device under mpxio effects the interfaces used to
        discover, report, and explore the topology, but the fundamental
        topology itself does not change. With this in mind, it is
        important to be able to 'name' things independent of mpxio.
        For paths, this can be done by recognizing that an mdi pathinfo
        node, and its libdevinfo path node counterpart, can be
        represented by the same path string that would be used if the
        device was enumerated using a devinfo node.

        This proposal defines the string representation of a path to a
        pathinfo node, introduces the concept of "path_instance"
        associated with these paths, and provides new APIs to expose
        and utilize these two concepts.

        The proposal also addresses some inconsistencies in the
        libdevinfo(3LIB) path node interfaces, and seeks to promote the
        'path' node interfaces defined by [1] and 'devlink' libdevinfo
        interfaces defined by [6,13,14] from consolidation private to
        evolving.

        The proposal also covers a minor enhancement to the eversholt
        language and run-time environment as well as a new output
        filter option for fmdump(1M).

      4.2.1 Path String and Path Instance

        Each devinfo node in the device tree can be uniquely
        represented by a path_string, with a separate
        "/node_na...@unit_addr]" path_string component for each level
        in the tree. For each devinfo node that succeeds attach(9E),
        the I/O framework persists a unique "instance to path_string"
        mapping (/etc/path_to_inst file, includes driver name too).

        A pathinfo node is a peer of a devinfo node, but currently
        lacks the formal definition and supporting interfaces to make
        this obvious. Just like a devinfo node, a pathinfo node has a
        "node_name" (inherited from the mpxio 'client' node) and a
        "unit_address" (_bus_addr). Given this, just like a devinfo
        node, a pathinfo node can have a path_string representation and
        a unique path_instance associated with the path_string.

        For a given device, the path_string representation of a
        pathinfo node is identical in value to the devinfo path_string
        had the device *not* been enumerated under MPXIO.

        The path_instance mechanism proposed will persist the
        "path_instance to path_string" mapping across the
        destruction/recreation of a pathinfo nodes (and across
        detach(9E)/re-attach(9E) of any parent) - keeping the
        path_instance value valid beyond any locking scope. However,
        unlike a devinfo node instance, a path_instance does not
        persist across reboot. The path_instance of a path node is
        available in a di_init(3DEVINFO) snapshot, and can be used to
        direct device access down a specific path using the proposed
        uscsi(7I) extensions.

        The fact that the devinfo/pathinfo path-string representation
        remains the same independent of MPXIO enable/disable, and that
        it provides a physical hardware orientation make it an ideal
        enhancement to current dev scheme FMRI representation used in
        ereports.

        NOTE: While a pathinfo node is similar to a devinfo node
        relative to path-string representation, there are also many
        distinguishing features such as:  a driver does not bind to a
        pathinfo node, and a pathinfo node does not have minor_nodes
        associated with it.

        NOTE: Past issues around different unit-address representation
        for devinfo.vs.pathinfo have been resolved (4953227,6274205).

        NOTE: If needed, the path_instance mechanism could be extended
        in the future to apply to devinfo nodes too - providing a way
        to get the ddi_devpathname() of a device independent of the
        state of the devinfo node. One motivation for doing this would
        be if the 1K of stack space currently used
        ddi_fm_ereport_post(9F) in the nosleep code path is ever poses
        a stack-overflow problem. A devinfo path_instance would remove
        the need for that 1K (MAXPATHLEN) stack space. In addition the
        path_instance could be extended to keep both device path and
        device identity - so that different devices seen at the same
        path end up with a different device_instance. This is would be
        useful for device paths that don't include identity information
        in the unit-address of the final path component. A change in
        the path_instance of a node might indicate the need to fence
        the access until the identity change is properly coordinated.

      4.2.2 Problems with scsi_pkt(9S)

        The scsi_pkt(9S) structure is the primary mechanism used for
        communication between a SCSA target driver and a SCSA HBA
        driver. This proposal extends the scsi_pkt(9S) structure to
        include path_instance.

        From the target driver perspective: at command issue time a
        non-zero pkt_path_instance requests a command to be sent down a
        specific path (zero indicates vHCI selects a path), and at
        pkt_comp() time the pkt_path_instance communicates which path
        in fact used selected. This means that the pkt_path_instance at
        pkt_comp() time can be used to generate the FMRI of the actual
        hardware path used. At command issue time the pkt_path_instance
        field allows the target driver to select a specific path: one
        of its own choosing, or the specific path indicated by a user
        level application using the uscsi(7I) extensions proposed.

        The DDI does not allow a driver to allocate it's own
        scsi_pkt(9S), a driver should not have *any* compiled in
        dependencies on "sizeof (struct scsi_pkt)": a driver that
        violates these rules limits SCSA's ability to evolve. The
        scsi_pkt allocation rules have been in place for many (>10)
        years, unfortunately a significant number of drivers are still
        broken - making the scsi_pkt structure difficult to extend.

        As part of this work SCSA will be enhanced to detect HBA
        drivers with scsi_pkt(9S) allocation violations - printing a
        message like

          WARNING: mpt: violates DDI scsi_pkt(9S) allocation rules

        for each driver found in violation (one per boot).

        CRs were filed against some broken HBA drivers, a long time ago
        (based on code inspection). Many of these CRs were just closed
        without ever addressing the problem (and some of the drivers
        may be EOLed at this point).

        
http://monaco.sfbay.sun.com/detail.jsf?cr=5039931,5039932,5039934,5039935,5039936,5039937,5039938,5039941,5039942

        For initial nevada putback, the message above will be displayed
        (once per driver) for debug kernels. If we are unable to get
        enough drivers fixed, we may need to disable the messages
        completely.

        The best way of fixing scsi_pkt allocation violations is to
        change an HBA driver to use the tran_setup_pkt(9E) interfaces
        defined by [3,4]. If this proves difficult, we may need to
        implement a scsi_pkt_size(9F) peer of the buf(9S) biosize(9F)
        interface.

        To implement this a new scsi_pkt_allocated_correctly()
        interface is provided. While HBA drivers are being fixed,
        access to pkt_path_instance must be conditioned by calls to
        scsi_pkt_allocated_correctly(). For maximum flexibility, HBA
        drivers that enumerate under scsi_vhci (fcp, iscsi, mpt, ibsrp)
        be fixed first.


      4.2.3 Uscsi enhancements allow path selection by path_instance

        This proposal enhances uscsi(7I) to support path_instance based
        path steering via a new uscsi_path_instance "struct uscsi_cmd"
        field and a new USCSI_PATH_INSTANCE uscsi_flags bit. To
        preserve the last remaining uscsi_cmd field (uscsi_reserved_5)
        for future expansion, the input-only uscsi_path_instance field
        overlays the current output-only uscsi_resid field. The common
        scsi_uscsi_alloc_and_copyin() interface (6451061) is enhanced
        to only allow USCSI_PATH_INSTANCE operation to the
        scsi_vhci(7D) HBA, and a new scsi_uscsi_initpkt() is defined to
        isolate scsi_pkt_alloced_correctly() use. This is a compatible
        enhancement to uscsi(7I): the uscsi_path_instance field is only
        considered valid if the new USCSI_PATH_INSTANCE bit is set.

        The MPAPI implementation of path steering provided by [2] makes
        sense for MPAPI applications, but its existence does not
        preclude the implementation of other path steering mechanisms.
        The path steering mechanism provided by this case will be
        easier to use for di_init(3DEVINFO) snapshot consumers.


      4.2.4 ddi_fm_ereport_post/fm_dev_ereport_postv() interface enhancements

        The device tree is not represented by a single data structure
        with embedded type information - the way vnodes are. Instead,
        the device tree is composed of a number of different data
        structures linked together: devinfo nodes (directories VDIR),
        minor nodes (files VBLK/VCHR), and pathinfo nodes (a bit like a
        hardlink VLNK).

        The current ddi_fm_ereport_post(9F) interface is limited to
        just the concept of a devinfo node in conjunction with an
        fma_capable driver bound to the node. The proposal extends the
        set of dev scheme *_fm_ereport_post interfaces to cover the
        other types of nodes in the device tree and to better support
        nexus-child relationships.

        Nexus driver interfaces (ndi_*) are all private. When exposure
        is necessary, nexus concepts are abstracted via DDI
        interfaces. For example the scsi_device(9S) data structure is
        an abstraction of a leaf target-driver devinfo child below a
        SCSA HBA nexus driver.

        For fma, we need basic building-block ddi_fm_*() interfaces
        that can be leveraged when adding fma support to abstractions.
        This proposal delivers a private fm_dev_ereport_postv()
        'va_list' building-block interface, and converts
        ddi_fm_ereport_post() to use fm_dev_ereport_postv(). The
        fm_dev_ereport_postv() interface can support both ddi_, ndi_,
        and abstracted callers. The abstracted callers may understand
        pathinfo node operations and device identity.

      4.2.5 Eversholt

        This fasttrack also adds a minor feature to the "Eversholt"
        language used to describe fault trees [8,9]. The change is
        required for disk FMA work [11] and will not cause any
        compatibility issues. The stability of the Eversholt language
        remains Sun Private and the release binding for the changes
        described here is micro/patch.

        On approval, new version of the "Eversholt Language Manual"
        will be available [9]. The change adds a new optional property
        to Section 2.3.1.5 "Error Report Events" called
        'discard_if_config_unknown', with the following additions to
        the associated table and text:

          Property                  Required or         Allowed
                                    Optional            Types

          discard_if_config_unknown Optional            Integer


          The 'discard_if_config_unknown' property, when given a
          non-zero value, tells the run-time environment that a failure
          to associate an event with the current configuration should
          result in the event being silently discarded.

        In addition to the Eversholt language change, the fmd eversholt
        run-time environment is enhanced to support the new property,
        and to implement ereport-configuration match based on devid.

        These changes allow kernel generated ereports that include
        devid information to map to the configuration topology based on
        devid. An ereport defined with 'discard_if_config_unknown' that
        fails to find a configuration mapping will silently be
        discarded. This allows us to support structured error logs
        (fmdump -e) on machine configurations where the full storage
        topology is not yet represented.

      4.2.6 Prtconf output

        Prtconf output is changed to show path_instance and path_string
        (+ output below).

            Paths from multipath bus adapters:
          +   Path 3: /pci at 8,600000/pci at 1/SUNW,qlc at 5/fp at 0,0/ssd at 
w50020f230000826a,4
              fp#6 (online)
                ...
          +   Path 10: /pci at 8,600000/pci at 1/SUNW,qlc at 4/fp at 0,0/ssd at 
w50020f230000826a,4
              fp#4 (online)
                ...
          +   Path 17: /pci at 8,700000/SUNW,emlxs at 3,1/fp at 0,0/ssd at 
w50020f230000826a,4
              fp#2 (online)
                ...
          +   Path 24: /pci at 8,700000/SUNW,emlxs at 3/fp at 0,0/ssd at 
w50020f230000826a,4
              fp#0 (online)
                ...


      4.2.7 fmdump filter enhancement

        The fmdump(1M) CLI introduced by [12] is enhanced to provide a
        new '[-n name[.name]*[=value]' output filter option. This
        filter option works on both fault logs (fmdump) and error logs
        (fmdump -e). It filters based on ereports having properties
        with the specified name as well as that property having the
        specified value. For string properties the value can be a
        regular expression. Support for embedded nvlist property
        filtering is provided by using a name that that crosses
        multiple levels, each level is separated by a '.'. Examples:

          # fmdump -e | wc -l
          5
          # fmdump -en detector.devid | wc -l
          4
          # fmdump -en \
            'detector.devid=id1,sd at THITACHI_DK32EJ-36NC_____433H8282' | wc -l
          2
          # fmdump -en 'detector.devid=.*HITACHI.*' | wc -l
          2

   4.4  Interface Tables

        ------------------------------------------------------------------------
        INTERFACES BEING REMOVED     Old
        Interface                    Level      Comments
        ------------------------------------------------------------------------
        libdevinfo(3LIB):
          di_path_phci_path          Cons.Priv. ARCed and defined in
                                                libdevinfo.h, but never
                                                implemented. It is
                                                better to match
                                                di_devfs_path()
                                                structure with new
                                                di_path_devfs_path()
                                                below

          di_path_client_path        Cons.Priv. same, see
                                                di_path_client_devfs_path()
                                                below.

          di_path_addr               Cons.Priv. Currently unused.
                                                Mismatch with di_bus_addr()
                                                peer. See di_path_bus_addr()
                                                below.

        ------------------------------------------------------------------------
        INTERFACES BEING RENAMED     Old Level  Comments
        AND PROMOTED                 New Level
        Interface                     
        ------------------------------------------------------------------------
        libdevinfo(3LIB):
          (old)di_path_next_phci->   Cons.Priv. Confusing name: caller
          (new)di_path_client_next_path         starts with a *client*
                                     Evol.      dip (not a phci dip)
                                                and iterates through
                                                paths associated with
                                                *client*. Also, a
                                                client can have
                                                multiple paths
                                                associated with the
                                                same phci.

          (old)di_path_next_client-> Cons.Priv. Confusing name: caller
          (new)di_path_phci_next_path           starts with a *phci*
                                     Evol.      dip (not a client) and
                                                iterates through paths
                                                associated with phci.

        ------------------------------------------------------------------------
        INTERFACES BEING PROMOTED 
        Interface                       Level   Comments
        ------------------------------------------------------------------------
        libdevinfo(3LIB): (from PSARC/1999/647 [1]: Cons.Priv. -> Evol.):

           DINFOPATH                    Evol.   di_init() flag to snapshot path

                                                get . associated with path
           di_path_client_node          Evol.   .client
           di_path_phci_node            Evol.   .phci

                                                get path property as . array
           di_path_prop_bytes           Evol.   .byte
           di_path_prop_int64s          Evol.   .int64
           di_path_prop_ints            Evol.   .integer
           di_path_prop_strings         Evol.   .string

                                                look for property as . array
           di_path_prop_lookup_bytes    Evol.   .byte
           di_path_prop_lookup_int64s   Evol.   .int64
           di_path_prop_lookup_ints     Evol.   .integer
           di_path_prop_lookup_strings  Evol.   .string

           di_path_prop_name            Evol.   get name of path property
           di_path_prop_type            Evol.   get type of path property

           di_path_prop_next            Evol.   walk path properties

           di_path_state                Evol.   get path state
           DI_PATH_STATE_FAULT          Evol.   failed, not currently in use
           DI_PATH_STATE_OFFLINE        Evol.   not available to rcv/xmit data
           DI_PATH_STATE_ONLINE         Evol.   available to rcv/xmit data
           DI_PATH_STATE_STANDBY        Evol.   up, but not currently in use

        ------------------------------------------------------------------------
        INTERFACES BEING PROMOTED 
        Interface                       Level   Comments
        ------------------------------------------------------------------------
        libdevinfo(3LIB): (PSARC/2000/310 [13]: Cons.Priv. -> Evol.):
                          (PSARC/2002/239 [14]: Cons.Priv. -> Evol.):

           di_devlink_init()            Evol.   Obtain a snapshot of
                                                the devlink database.
             di_devlink_handle_t        Evol.   opaque handle to snapshot.
             DI_MAKE_LINK               Evol.   Update /dev
           di_devlink_fini()            Evol.   Destroy snapshot.

           di_devlink_walk()            Evol.   Walk links in snapshot.
             di_devlink_t               Evol.   opaque handle to devlink.

           di_devlink_path()            Evol.   Get devlink path.
           di_devlink_content()         Evol.   Get devlink contents.
           di_devlink_type()            Evol.   Get devlink type.
             DI_PRIMARY_LINK            Evol.   devlink to /devices
             DI_SECONDARY_LINK          Evol.   devlink to devlink

           di_devlink_dup()             Evol.   Copy a devlink object
           di_devlink_free()            Evol.   Free a devlink object

        ------------------------------------------------------------------------
        NEW INTERFACES
        Interface                       Level   Comments
        ------------------------------------------------------------------------
        libdevinfo(3LIB):
           di_path_node_name()          Evol.   di_path_t peer of
                                                di_node_t oriented
                                                di_node_name().

           di_path_bus_addr()           Evol.   di_path_t peer of
                                                di_node_t oriented
                                                di_bus_addr().  Also fixing
                                                CR6284426 "di_path_addr
                                                should have its second
                                                argument removed".

           di_path_instance()           Evol.   di_path_t peer of
                                                di_node_t
                                                di_instance().

           di_path_devfs_path()         Evol.   di_path_t peer of
                                                di_node_t
                                                di_devfs_path().

           di_path_client_devfs_path()
                                        Evol.   di_path_t peer of
                                                di_node_t
                                                di_devfs_path().

           di_path_private_get()        Evol.   di_path_t peer of
                                                di_node_t
                                                di_node_private_get().

           di_path_private_set()        Evol.   di_path_t peer of
                                                di_node_t
                                                di_node_private_set().


           di_lookup_node()          Cons.Priv. di_path_t peer of
                                                di_node_t
                                                di_lookup_node().

           di_lookup_path()          Cons.Priv. di_path_t peer of
                                                di_node_t
                                                di_lookup_node().

           di_devfs_path_match()     Cons.Priv. check to see if two
                                                /devices paths are the
                                                same, ignoring any
                                                generic.vs.non-generic
                                                node name mismatches.

        uscsi(7I):
           .uscsi_path_instance      Cons.Priv. new "struct uscsi_cmd" field
                                                path_instance to send
                                                command down if
                                                USCSI_PATH_INSTANCE
                                                set.

           USCSI_PATH_INSTANCE       Cons.Priv. new uscsi_flags bit,
                                                send command down
                                                specific path

           scsi_uscsi_initpkt()      Cons.Priv. New uscsi setup
                                                interface for target
                                                driver implementing
                                                uscsi path_instance.

        scsi_pkt(9S):
           .pkt_path_instance        Cons.Priv. new scsi_pkt(9S) field holding
                                                path_instance.

           FLAG_PATH_INSTANCE        Cons.Priv. new pkt_flags bit, send
                                                command down specific
                                                path.

           FLAG_PATH_INSTANCE_RPT    Cons.Priv. new pkt_flags bit,
                                                path_instance reports
                                                path used.

           scsi_pkt_pathname();      Cons.Priv. Given scsi_pkt and
                                                scsi_device, return
                                                /devices path_string
                                                used to send command.

        mdi:
           .pi_path_instance         Cons.Priv. pathinfo node's path_instance

           MDI_SELECT_PATH_INSTANCE  Cons.Priv. New mdi_path_select()
                                                method.

           mdi_pi_get_path_instance()Cons.Priv. Return path_instance
                                                given pathinfo.

           mdi_pi_pathname()         Cons.Priv. pathinfo peer of
                                                devinfo
                                                ddi_pathname().

           mdi_pi_pathname_by_instance()        
                                     Cons.Priv. Return path_string
                                                given path_instance.
                                                fma:

           ndi_fm_ereport_post()     Cons.Priv. ndi form of
                                                ddi_fm_ereport_post(9F).

           fm_dev_ereport_postv()    Cons.Priv. Common implementation
                                                code (and used by
                                                implementation of
                                                scsi_fm_ereport_post()).

        eversholt:

           discard_if_config_unknown Private    optional property of
                                                .esc 'ereport'
                                                declaration.


        prtconf(1M):

           prtconf -v (output)       Not.an.    show path_string
                                     Interface

        fmdump(1M):

           fmdump -n name[.name]*[=value]
                                        Evol.   -n filter option.

           fmdump -n (output)        Not.an.
                                     Interface


    4.5 References

        [1] Multiplexed I/O Framework
            http://sac.sfbay/PSARC/1999/647
            http://www.opensolaris.org/os/community/arc/caselog/1999/647

        [2] mpxio path steering (MPAPI)
            http://sac.sfbay/PSARC/2006/621
            http://www.opensolaris.org/os/community/arc/caselog/2006/621

        [3] new scsi_hba_tran entry points (scsi_pkt)
            http://sac.eng.sun.com/PSARC/2005/680/mail
            http://www.opensolaris.org/os/community/arc/caselog/2005/680

        [4] scsa dma enhancement (scsi_pkt)
            http://sac.eng.sun.com/PSARC/2006/240/mail
            http://www.opensolaris.org/os/community/arc/caselog/2006/240

        [5] Dev scheme specification - Section 8.4.3
            http://fma.eng/documents/engineering/protocol_whtppr.pdf

        [6] libdevinfo reimplementation
            http://sac.sfbay/PSARC/1997/127/commit.materials/devinfo.pdf

        [7] MDI/pHCI/libdevinfo Extensions for SNIA MPAPI support
            http://sac.sfbay/PSARC/2005/646

        [8] Eversholt Diagnosis Technology
            http://sac.eng/PSARC/2003/428

        [9] Eversholt Language Manual (Version 1.5 10/04/06)
            http://eversholt.central/docs/language/

        [10]Generic Topology for Internal Disks
            http://sac.sfbay/PSARC/2007/388/mail
            
http://wikihome.sfbay/fma-portfolio/Wiki.jsp?page=2007.016.DiskTopology

        [11]Unified Disk FMA
            
http://wikihome.sfbay/fma-portfolio/Wiki.jsp?page=2007.015.UnifiedDisk

        [12]Solaris Fault Management Daemon
            http://sac.sfbay/PSARC/2003/089/

        [13]libdevinfo devlinks interfaces
            http://sac.sfbay/PSARC/2000/310

        [14]Devlink Creation Enhancements
            http://sac.sfbay/PSARC/2002/239

    4.6 Man page changes

        See materials directory in case directory has information for
        the following new/changed man pages.

        Deliver (Evolving):
            di_devfs_path.3devinfo.diff
            di_devlink_dup.3devinfo
            di_devlink_init.3devinfo
            di_devlink_path.3devinfo
            di_devlink_walk.3devinfo
            di_init.3devinfo.diff
            di_lnode_private_set.3devinfo.diff
            di_path_info.3devinfo
            di_path_next.3devinfo
            di_path_prop_access.3devinfo
            di_path_prop_lookup.3devinfo
            di_path_prop_next.3devinfo
            fmdump.1m.diff
            libdevinfo.3lib.diff

            Eversholt.index.html.diff.txt

        Information in Case directory (Cons.Priv.)      
            uscsi.7i.diff


6. Resources and Schedule
    6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
                ON
    6.5. ARC review type: FastTrack
    6.6. ARC Exposure: open


Reply via email to