Template Version: @(#)sac_nextcase 1.69 02/15/10 SMI
This information is Copyright 2010 Sun Microsystems
1. Introduction
    1.1. Project/Component Working Name:
         IPoIB Administration Enhancement
    1.2. Name of Document Author/Supplier:
         Author:  Sudhakar Dindukurti
    1.3  Date of This Document:
        10 March, 2010
4. Technical Description

4.1 Acronyms

HCA               : Host Channel Adaptor
P_Key             : Partition Key
IPoIB             : IP over InfiniBand
GUID              : Global Unique Identifier
IBTF              : InfiniBand Transport Framework
IBCM              : InfiniBand Communication Manager
SDP               : Sockets Direct Protocol
IBD               : current IPoIB Solaris driver name
ULP               : Upper level Protocol

4.2 Requirements/Motivation
---------------------------

4.2.1 Consistent IBD device node naming across nodes in a clusters

(Amber Road Requirement (CR 6864899))

AmberRoad  clustering software requires that datalink name for a specific
partition on multiple nodes (identical h/w configuration) to be same. For
example, if datalink name on node 1 for a P_key 0x8001 on HCA 1 and port 1
is ibnet0,  then the  clustering software  expects that datalink name for
P_Key  0x8001 on  HCA 1 and  port 1 is  also  ibnet0 on the node 2 in the
cluster. This is  very difficult to  achieve  today. The problem is  that
IPoIB  device  name is  constructed  automatically  by  IPoIB  driver  by
appending the instance number to the driver name (ex ibd0, ibd1, etc). So,
the clustering  software does not  have any  control on the  IPoIB device 
name space. Also,  IBTF framework does not  guarantee that the same names
across multiple nodes.

4.2.2 Problem diagnosis issue

IPoIB (PSARC 2001/289,  PSARC 2009/593 and PSARC 2007/636) is implemented
as the  ibd(7D)  driver in  Solaris.  When  customers  face  problems  in
bringing up the  IBD devices  on Solaris, they are advised to take number
of  steps to  gather information  about the  IPoIB link  that  results in 
usability issues. Today, users need to run cfgadm(1M) to  obtain the port
GUID  information, do  'ls-l' on the  /dev/ibd* nodes to determine the IB
partition  that an  IBD device  belongs to etc. There are also  instances
when it is  necessary to obtain the P_key corresponding to the IBD device
and HCA port  that it is bound to.  So, we want to improve the process of
gathering IPoIB related configuration information.

4.2.3 Update IBD driver to use Brussels framework (CR 6883212) 

IPoIB  tunables  are  managed  today through /etc/system file or ibd.conf.
For example, to modify the 'linkmode' of the IPoIB link, one needs to edit
ibd.conf and reboot the system. We want to replace this interface so that
user can manage these tunables using dladm(1M). 

4.3 Proposal

A micro/patch binding is asserted for this proposal.

4.3.0 IPoIB administration with dladm(1M)
----------------------------------------

This case proposes a new IPoIB  administration  mechanism for  InfiniBand
network   datalinks using  dladm(1M)  command.  Also, it  proposes to add
a consolidation private InfiniBand specific library API to  libdladm(3LIB).
The   mechanism  is  very  similar  to  the   existing  VLAN,  VNIC, etc.
administration dladm(1M) sub-commands.

In the new model, two classes of IPoIB datalinks will exist:

     1. Datalinks representing the physical  IB ports, which will use the
        existing "phys" class that is used for  Ethernet, WiFi, and other
        physical media.  As  with all  "phys" class  objects,  the system
        will create these automatically.

     2. Datalinks  representing the  administratively  created partitions
        over "phys" IPoIB objects,  which  will  use a  new "part" class.
        Each IB partition datalink will be  associated with a P_Key  in a
        manner  analogous  to the  way  each   Ethernet  VLAN datalink is
        associated with a VLAN ID.

Note  that  unlike other  "phys"  class objects, IB "phys" objects cannot
send data (since  IB requires a  P_Key to  send data)  and thus cannot be
plumbed.

Example configuration used to  explain the new IPoIB administration model  

               --------------------------------------
              |              IB switch               |
               --------------------------------------
                   |                         |
                   |                         |
                   |                         |
         Port1 |   | Port2           Port1   |   | Port2
               --------------------------------------
              |  HCA1                         HCA2   | 
              |              Node 1                  |
               --------------------------------------

Port 2 of the HCA1 and Port 1 of the  HCA2 are connected to IB switch and
SM is running the switch. Each port is configured with two P_Key's (0xffff
& 0x8001).
 

4.3.1 Physical datalinks
-------------------------

One physical datalink will be  created by default per port per HCA. These
physical links serve as administrative & observability data points. These
IB physical  datalinks allow  creating IB  partitions  over them  similar
to creating  VNICs on  Ethernet physical  links or link  aggregations. IB 
physical  datalinks are  not used for  data transfers.  So, the plumb and
assigning a IB address are not supported on these links. In future, these
physical  datalinks can  be  used for  1)  implementing  the  port  level
statistics 2) implementing port level snoop etc.

Example 1. # dladm show-phys
LINK     MEDIA        STATE      SPEED     DUPLEX     DEVICE
ibp0    InfiniBand     up        8000      unknown    ibp0
ibp1    InfiniBand    down       8000      unknown    ibp1
ibp2    InfiniBand    down       2000      unknown    ibp2
ibp3    InfiniBand     up        2000      unknown    ibp3

The state of the physical link is directly corresponds to state of the IB
HCA port. As you  might expect other generic sub-commands such as rename-
link, show-link, delete-phys, etc. also work on IB datalinks.

4.3.2 IB partition Objects
--------------------------

IB  partition objects represent a new "part"  class of datalink and these
objects are  managed  using the  new dladm(1M)  sub-commands. All the new
sub-command interfaces are similar to VLAN/VNIC dladm(1M) sub-commands. IB
partition datalinks can be created  on the  top of  IB physical links one
per each  P_Key on the port.  These  links  are  used for data transfers.
The updated dladm(1M) man page describes the different commands available
for managing the IB partition links.

4.3.2.1 create-part

      Creates a new IB partition Object with the specified datalink name.

      Example 2: Create a IB partition link for the  P_Key  0x8001 on the
                 top of ibp0 physical datalink

      # dladm create-part -l ibp0 -P 0x8001 p8001.ibp0

      The above commands succeeds if the port is "up", P_Key is present on
      the port and  IPoIB  successfully completes its initialization. On
      subsequent reboots after successful initial creation, the partition
      will be available even if port is "down" or P_Key is absent, though
      the datalink state will be marked as "down".

      Example 3:  Create an IB partition link for the P_Key 0x9000 on the
                  top of ibp2

      # dladm create-part -f  -l ibp2 -P 0x9000 p9000.ibp2

      The force option  "-f"  allows to create the IB partition even when
      the  P_Key is not  present or Port is down.  The link state will be 
      marked  as down.  The link state will be updated to "up" when P_Key
      is added to the port and port is activated.

      Example 4: Plumb and assign a IP address to IB partition p9000.ibp2

      # ifconfig p9000.ibp2 plumb up

      # ifconfig -a

      p9000.ibp2: flags=1000843<UP,BROADCAST,RUNNING,
           MULTICAST,IPv4> mtu 2044 index 3
        inet 1.1.1.1 netmask ff000000 broadcast 1.255.255.255

     Example 5: Display the partition links using show-link command

     # dladm show-link
     LINK        CLASS     MTU    STATE    BRIDGE       OVER
     p8001.ibp0  part      65520  unknown    --         ibp0
     p9000.ibp2  part      65520  down       --         ibp2

4.3.2.2 delete-part

      Deletes a specified IB partition object

      Example 6: Delete the partition p8001.ibp0

      # dladm  delete-part      p8001.ibp0

      The above command deletes the partition.

      Example 7: Show the partition link information after 
                 deleting p8001.ibp0

      # dldam show-part

      LINK        PKEY      OVER     STATE      FLAGS
     p9000.ibp2   9000      ibp2      down       f---

4.3.2.3 show-part

     Displays IB partition object information

     Example 8: Display the IB partition links information (below output
                is for the "part" operations in Example 2, 3 & 4)

     # dldam show-part

     LINK           PKEY      OVER       STATE      FLAGS
   p8001.ibp0       8001      ibp0      unknown     ----
   p9000.ibp2       9000      ibp2      down        f---

   The state  of the IB partition link will be "unknown" after IB partition
   is created and before IB partition is plumbed. Once partition is plumbed
   the  link  state will be set to "up" when the link is ready to use.  The
   state of the  link will be set to "down" if 1) HCA port is down 2) P_Key
   is absent or 3) broadcast group is absent.

4.3.2.4 show-ib

    Displays IB specific information such as port#, port guid, etc.

    Example 9: Show IB specific information

    # dladm show-ib
    LINK         HCAGUID         PORTGUID        PORT STATE  PKEYS
    ibp0         3BA000100CD7C   3BA000100CD7D   1    down   FFFF
    ibp1         3BA000100CD7C   3BA000100CD7E   2    down   FFFF
    ibp3         5AD0000033634   5AD0000033636   2    up     FFFF,8001
    ibp2         5AD0000033634   5AD0000033635   1    up     FFFF,8001

    show-ib  commands  display only the  physical  links, port GUID, port#
    HCA GUID, and  P_Key present on the  port at the  time of  running the
    command.

4.3.3 IB partition object Administration Library 

The  libdladm(3LIB)  library  is  currently  used  to  implement  datalink
administration  for  all  the  GLDv3  datalinks  (VNIC, link  aggregation,
wireless, IP Tunnel, etc.). The library will be further enhanced to provide
administrative  interfaces of  InfiniBand  partition  objects. All the new
library  extensions  are  similar  to  VLAN/VNIC  libdladm  extensions. IB
specific functionality will be implemented by libdlib.c  and its interface
is provided via  libdlib.h.  dladm(1M) will use this  library for managing
IB   partitions.  The  IB  partition   administration  library  provides a
persistent repository which allows IB partition configuration to be stored
across  reboots.  It uses the existing dladm(1M)  /etc/dladm/datalink.conf 
repository to store the IB partition configuration.

The list of InfiniBand specific extensions to libdladm library is given below.
For  more  details  of  each  API,  see  man  pages in the  PSARC materials 
directory. The man pages are only for PSARC review (not intended for public
use).

4.3.3.1 dladm_part_create()
    Create a IB partition object

4.3.3.2 dladm_part_delete()
    Deletes a IB partition object

4.3.3.3 dladm_part_info()
    Returns IB partition Object attributes

4.3.3.4 dladm_ib_info()
    Returns IB specific attributes such as port number, port guid,
    and HCA GUID.

4.3.3.5 dladm_part_up()
    Brings up one or all the IB partition objects during every boot.

4.3.4 IBTF Extensions
---------------------

Some of the IB ULP's such as SDP, IBCM, etc. walk the device tree and read
"port-pkey", "hca-guid", and "port-guid" properties from the ibd(7D) device
instance. With the new model, IB partitions  are just  objects (no  longer
device nodes).  So, IB ULP's no  longer can retrieve partition  attributes
from the  device  node.  The  following new  IBTF  extensions   provide  a 
mechanism  to  retrieve  the  partition  attributes  in  the   kernel. See 
ibt_get_part_attrs() man page in the PSARC case directory for more details
about the  following API.  The man pages are  only for  PSARC review  (not
intended for public use).

4.3.4.1 ibt_get_part_attrs()
      Returns the attributes of a requested IB partition 

4.3.4.2 ibt_get_all_part_attrs() 
      Returns the attributes of all the active IB partitions

4.3.4.3 ibt_free_part_attrs()
      Frees the memory for partition attribute structure allocated by
      ibt_get_all_part_attrs()

4.3.5 Man page changes 
----------------------

    Updated man pages - published
         dladm(1M)
         datadm(1M)
         dat.conf(4)
         ibp(7D)  (ibd(7d) name changed)
    
    New Man pages - internal only
         dladm_part_create(3dladm)
         dladm_part_delete(3dladm)
         dladm_part_info(3dladm)
         dladm_ib_info(3dladm)
         dladm_part_up(3dladm)

         ibt_get_part_attrs(9f)

4.3.6 Interface table
---------------------

 -----------------------------------------------------------------------
|    Interface name                      |  Commitment Level            |
 -----------------------------------------------------------------------
|  dladm(1M)  extensions                                                |
 -----------------------------------------------------------------------
|      create-part                       |                              |
|      delete-part                       | committed                    |
|      show-ib                           |                              |
|      show-part                         |                              |
 -----------------------------------------------------------------------
|  libdladm extensions   (libdlib.h)                                    |
 -----------------------------------------------------------------------
|     dladm_part_create                  |                              |
|     dladm_part_delete                  |                              |
|     dladm_part_show                    |                              |
|     dladm_show_ib                      | ON Consolidation Private     |
|     dladm_part_up                      |                              |
|     dladm_part_attr_t                  |                              |
|     dladm_ib_attr_t                    |                              |
|     DLADM_IBPART_FORCE_CREATE          |                              |
 -----------------------------------------------------------------------
|  InfiniBand Specific Link Properties                                  |
 -----------------------------------------------------------------------
|     linkmode                           | ON Consolidation Private     |
 -----------------------------------------------------------------------
|  libdlmgmt.h                                                          |
 -----------------------------------------------------------------------
|     DATALINK_CLASS_IBPART              | ON Consolidation Private     |
 -----------------------------------------------------------------------
|  libdladm.h                                                           |
 -----------------------------------------------------------------------
|     DLADM_STATUS_INVALID_PORT_INSTANCE |                              |
|     DLADM_STATUS_PORT_IS_DOWN          |                              |
|     DLADM_STATUS_PKEY_NOT_PRESENT      |                              |
|     DLADM_STATUS_PARTITION_EXISTS      | ON Consolidation Private     |
|     DLADM_STATUS_INVALID_PKEY          |                              |
|     DLADM_STATUS_NO_HW_RESOURCE        |                              |
|     DLADM_STATUS_INVALID_PKEY_TBL_SIZE |                              |
 -----------------------------------------------------------------------
|  IBTF extensions                                                      |
 -----------------------------------------------------------------------
|      ibt_get_part_attrs()              |                              |
|      ibt_get_all_part_attrs()          | ON Consolidation private     |
|      ibt_free_part_attrs()             |                              |
|      ibt_part_attr_t                   |                              |
 -----------------------------------------------------------------------
|   New status codes (ibt_status_t updates)                             |
 -----------------------------------------------------------------------
|      IBT_NO_SUCH_OBJECT                | ON Consolidation private     |
 -----------------------------------------------------------------------

4.3.7 References
---------------------

PSARC 2001/289 IP over InfiniBand
PSARC 2007/636 IPoIB Conversion to GLDv3
PSARC 2009/593 IPoIB Connected Mode

6. Resources and Schedule
    6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
                ON
    6.5. ARC review type: FastTrack
    6.6. ARC Exposure: open

Reply via email to