Okay, here are some questions; some of these stem from my lack of IPoIB 
knowledge, so I hope you'll pardon me if I ask things that seem obvious 
or stupid.

1) Will the new "phys" objects being created here be "snoop"-able?  (I 
understand that they can't transmit data, but can they operate in some 
sort of promiscuous receive mode?)  (Oh wait, it seems that this might 
be answered in part by 4.3.1.   It looks like the answer is "not yet, 
but in the future".  Correct?)

2) From my experience with "hermon", hermon seems mostly to treat both 
ports as a single registered entity to the IB framework.  I'm presuming 
that this won't preclude IPoIB from being able to identify which port is 
which?  Are there any special considerations here for automatic path 
migration?

3) Are there any constraints on the format of the name used for 
"part-link" in dladm?  The examples seem to show a specific format, but 
can customers choose any name they want?

4) The way this is designed seems to depend on a notion of a physical 
port.  Does IPoIB have anything that is morally equivalent to ethernet 
aggregations (trunking)?  I'm thinking also about link redundancy as 
well as bandwidth multiplication.

     - Garrett

On 03/10/10 03:52 PM, Ted Kim wrote:
> Template Version: @(#)sac_nextcase 1.69 02/15/10 SMI
> This information is Copyright 2010 Sun Microsystems
> 1. Introduction
>      1.1. Project/Component Working Name:
>        IPoIB Administration Enhancement
>      1.2. Name of Document Author/Supplier:
>        Author:  Sudhakar Dindukurti
>      1.3  Date of This Document:
>       10 March, 2010
> 4. Technical Description
>
> 4.1 Acronyms
>
> HCA               : Host Channel Adaptor
> P_Key             : Partition Key
> IPoIB             : IP over InfiniBand
> GUID              : Global Unique Identifier
> IBTF              : InfiniBand Transport Framework
> IBCM              : InfiniBand Communication Manager
> SDP               : Sockets Direct Protocol
> IBD               : current IPoIB Solaris driver name
> ULP               : Upper level Protocol
>
> 4.2 Requirements/Motivation
> ---------------------------
>
> 4.2.1 Consistent IBD device node naming across nodes in a clusters
>
> (Amber Road Requirement (CR 6864899))
>
> AmberRoad  clustering software requires that datalink name for a specific
> partition on multiple nodes (identical h/w configuration) to be same. For
> example, if datalink name on node 1 for a P_key 0x8001 on HCA 1 and port 1
> is ibnet0,  then the  clustering software  expects that datalink name for
> P_Key  0x8001 on  HCA 1 and  port 1 is  also  ibnet0 on the node 2 in the
> cluster. This is  very difficult to  achieve  today. The problem is  that
> IPoIB  device  name is  constructed  automatically  by  IPoIB  driver  by
> appending the instance number to the driver name (ex ibd0, ibd1, etc). So,
> the clustering  software does not  have any  control on the  IPoIB device
> name space. Also,  IBTF framework does not  guarantee that the same names
> across multiple nodes.
>
> 4.2.2 Problem diagnosis issue
>
> IPoIB (PSARC 2001/289,  PSARC 2009/593 and PSARC 2007/636) is implemented
> as the  ibd(7D)  driver in  Solaris.  When  customers  face  problems  in
> bringing up the  IBD devices  on Solaris, they are advised to take number
> of  steps to  gather information  about the  IPoIB link  that  results in
> usability issues. Today, users need to run cfgadm(1M) to  obtain the port
> GUID  information, do  'ls-l' on the  /dev/ibd* nodes to determine the IB
> partition  that an  IBD device  belongs to etc. There are also  instances
> when it is  necessary to obtain the P_key corresponding to the IBD device
> and HCA port  that it is bound to.  So, we want to improve the process of
> gathering IPoIB related configuration information.
>
> 4.2.3 Update IBD driver to use Brussels framework (CR 6883212)
>
> IPoIB  tunables  are  managed  today through /etc/system file or ibd.conf.
> For example, to modify the 'linkmode' of the IPoIB link, one needs to edit
> ibd.conf and reboot the system. We want to replace this interface so that
> user can manage these tunables using dladm(1M).
>
> 4.3 Proposal
>
> A micro/patch binding is asserted for this proposal.
>
> 4.3.0 IPoIB administration with dladm(1M)
> ----------------------------------------
>
> This case proposes a new IPoIB  administration  mechanism for  InfiniBand
> network   datalinks using  dladm(1M)  command.  Also, it  proposes to add
> a consolidation private InfiniBand specific library API to  libdladm(3LIB).
> The   mechanism  is  very  similar  to  the   existing  VLAN,  VNIC, etc.
> administration dladm(1M) sub-commands.
>
> In the new model, two classes of IPoIB datalinks will exist:
>
>       1. Datalinks representing the physical  IB ports, which will use the
>          existing "phys" class that is used for  Ethernet, WiFi, and other
>          physical media.  As  with all  "phys" class  objects,  the system
>          will create these automatically.
>
>       2. Datalinks  representing the  administratively  created partitions
>          over "phys" IPoIB objects,  which  will  use a  new "part" class.
>          Each IB partition datalink will be  associated with a P_Key  in a
>          manner  analogous  to the  way  each   Ethernet  VLAN datalink is
>          associated with a VLAN ID.
>
> Note  that  unlike other  "phys"  class objects, IB "phys" objects cannot
> send data (since  IB requires a  P_Key to  send data)  and thus cannot be
> plumbed.
>
> Example configuration used to  explain the new IPoIB administration model
>
>                 --------------------------------------
>                |              IB switch               |
>                 --------------------------------------
>                     |                         |
>                     |                         |
>                     |                         |
>           Port1 |   | Port2           Port1   |   | Port2
>                 --------------------------------------
>                |  HCA1                         HCA2   |
>                |              Node 1                  |
>                 --------------------------------------
>
> Port 2 of the HCA1 and Port 1 of the  HCA2 are connected to IB switch and
> SM is running the switch. Each port is configured with two P_Key's (0xffff
> &  0x8001).
>
>
> 4.3.1 Physical datalinks
> -------------------------
>
> One physical datalink will be  created by default per port per HCA. These
> physical links serve as administrative&  observability data points. These
> IB physical  datalinks allow  creating IB  partitions  over them  similar
> to creating  VNICs on  Ethernet physical  links or link  aggregations. IB
> physical  datalinks are  not used for  data transfers.  So, the plumb and
> assigning a IB address are not supported on these links. In future, these
> physical  datalinks can  be  used for  1)  implementing  the  port  level
> statistics 2) implementing port level snoop etc.
>
> Example 1. # dladm show-phys
> LINK     MEDIA        STATE      SPEED     DUPLEX     DEVICE
> ibp0    InfiniBand     up        8000      unknown    ibp0
> ibp1    InfiniBand    down       8000      unknown    ibp1
> ibp2    InfiniBand    down       2000      unknown    ibp2
> ibp3    InfiniBand     up        2000      unknown    ibp3
>
> The state of the physical link is directly corresponds to state of the IB
> HCA port. As you  might expect other generic sub-commands such as rename-
> link, show-link, delete-phys, etc. also work on IB datalinks.
>
> 4.3.2 IB partition Objects
> --------------------------
>
> IB  partition objects represent a new "part"  class of datalink and these
> objects are  managed  using the  new dladm(1M)  sub-commands. All the new
> sub-command interfaces are similar to VLAN/VNIC dladm(1M) sub-commands. IB
> partition datalinks can be created  on the  top of  IB physical links one
> per each  P_Key on the port.  These  links  are  used for data transfers.
> The updated dladm(1M) man page describes the different commands available
> for managing the IB partition links.
>
> 4.3.2.1 create-part
>
>        Creates a new IB partition Object with the specified datalink name.
>
>        Example 2: Create a IB partition link for the  P_Key  0x8001 on the
>                   top of ibp0 physical datalink
>
>        # dladm create-part -l ibp0 -P 0x8001 p8001.ibp0
>
>        The above commands succeeds if the port is "up", P_Key is present on
>        the port and  IPoIB  successfully completes its initialization. On
>        subsequent reboots after successful initial creation, the partition
>        will be available even if port is "down" or P_Key is absent, though
>        the datalink state will be marked as "down".
>
>        Example 3:  Create an IB partition link for the P_Key 0x9000 on the
>                    top of ibp2
>
>        # dladm create-part -f  -l ibp2 -P 0x9000 p9000.ibp2
>
>        The force option  "-f"  allows to create the IB partition even when
>        the  P_Key is not  present or Port is down.  The link state will be
>        marked  as down.  The link state will be updated to "up" when P_Key
>        is added to the port and port is activated.
>
>        Example 4: Plumb and assign a IP address to IB partition p9000.ibp2
>
>        # ifconfig p9000.ibp2 plumb up
>
>        # ifconfig -a
>
>        p9000.ibp2: flags=1000843<UP,BROADCAST,RUNNING,
>             MULTICAST,IPv4>  mtu 2044 index 3
>          inet 1.1.1.1 netmask ff000000 broadcast 1.255.255.255
>
>       Example 5: Display the partition links using show-link command
>
>       # dladm show-link
>       LINK        CLASS     MTU    STATE    BRIDGE       OVER
>       p8001.ibp0  part      65520  unknown    --         ibp0
>       p9000.ibp2  part      65520  down       --         ibp2
>
> 4.3.2.2 delete-part
>
>        Deletes a specified IB partition object
>
>        Example 6: Delete the partition p8001.ibp0
>
>        # dladm  delete-part      p8001.ibp0
>
>        The above command deletes the partition.
>
>        Example 7: Show the partition link information after
>                   deleting p8001.ibp0
>
>        # dldam show-part
>
>        LINK        PKEY      OVER     STATE      FLAGS
>       p9000.ibp2   9000      ibp2      down       f---
>
> 4.3.2.3 show-part
>
>       Displays IB partition object information
>
>       Example 8: Display the IB partition links information (below output
>                  is for the "part" operations in Example 2, 3&  4)
>
>       # dldam show-part
>
>       LINK           PKEY      OVER       STATE      FLAGS
>     p8001.ibp0       8001      ibp0      unknown     ----
>     p9000.ibp2       9000      ibp2      down        f---
>
>     The state  of the IB partition link will be "unknown" after IB partition
>     is created and before IB partition is plumbed. Once partition is plumbed
>     the  link  state will be set to "up" when the link is ready to use.  The
>     state of the  link will be set to "down" if 1) HCA port is down 2) P_Key
>     is absent or 3) broadcast group is absent.
>
> 4.3.2.4 show-ib
>
>      Displays IB specific information such as port#, port guid, etc.
>
>      Example 9: Show IB specific information
>
>      # dladm show-ib
>      LINK         HCAGUID         PORTGUID        PORT STATE  PKEYS
>      ibp0         3BA000100CD7C   3BA000100CD7D   1    down   FFFF
>      ibp1         3BA000100CD7C   3BA000100CD7E   2    down   FFFF
>      ibp3         5AD0000033634   5AD0000033636   2    up     FFFF,8001
>      ibp2         5AD0000033634   5AD0000033635   1    up     FFFF,8001
>
>      show-ib  commands  display only the  physical  links, port GUID, port#
>      HCA GUID, and  P_Key present on the  port at the  time of  running the
>      command.
>
> 4.3.3 IB partition object Administration Library
>
> The  libdladm(3LIB)  library  is  currently  used  to  implement  datalink
> administration  for  all  the  GLDv3  datalinks  (VNIC, link  aggregation,
> wireless, IP Tunnel, etc.). The library will be further enhanced to provide
> administrative  interfaces of  InfiniBand  partition  objects. All the new
> library  extensions  are  similar  to  VLAN/VNIC  libdladm  extensions. IB
> specific functionality will be implemented by libdlib.c  and its interface
> is provided via  libdlib.h.  dladm(1M) will use this  library for managing
> IB   partitions.  The  IB  partition   administration  library  provides a
> persistent repository which allows IB partition configuration to be stored
> across  reboots.  It uses the existing dladm(1M)  /etc/dladm/datalink.conf
> repository to store the IB partition configuration.
>
> The list of InfiniBand specific extensions to libdladm library is given below.
> For  more  details  of  each  API,  see  man  pages in the  PSARC materials
> directory. The man pages are only for PSARC review (not intended for public
> use).
>
> 4.3.3.1 dladm_part_create()
>      Create a IB partition object
>
> 4.3.3.2 dladm_part_delete()
>      Deletes a IB partition object
>
> 4.3.3.3 dladm_part_info()
>      Returns IB partition Object attributes
>
> 4.3.3.4 dladm_ib_info()
>      Returns IB specific attributes such as port number, port guid,
>      and HCA GUID.
>
> 4.3.3.5 dladm_part_up()
>      Brings up one or all the IB partition objects during every boot.
>
> 4.3.4 IBTF Extensions
> ---------------------
>
> Some of the IB ULP's such as SDP, IBCM, etc. walk the device tree and read
> "port-pkey", "hca-guid", and "port-guid" properties from the ibd(7D) device
> instance. With the new model, IB partitions  are just  objects (no  longer
> device nodes).  So, IB ULP's no  longer can retrieve partition  attributes
> from the  device  node.  The  following new  IBTF  extensions   provide  a
> mechanism  to  retrieve  the  partition  attributes  in  the   kernel. See
> ibt_get_part_attrs() man page in the PSARC case directory for more details
> about the  following API.  The man pages are  only for  PSARC review  (not
> intended for public use).
>
> 4.3.4.1 ibt_get_part_attrs()
>        Returns the attributes of a requested IB partition
>
> 4.3.4.2 ibt_get_all_part_attrs()
>        Returns the attributes of all the active IB partitions
>
> 4.3.4.3 ibt_free_part_attrs()
>        Frees the memory for partition attribute structure allocated by
>        ibt_get_all_part_attrs()
>
> 4.3.5 Man page changes
> ----------------------
>
>      Updated man pages - published
>           dladm(1M)
>           datadm(1M)
>           dat.conf(4)
>           ibp(7D)  (ibd(7d) name changed)
>
>      New Man pages - internal only
>           dladm_part_create(3dladm)
>           dladm_part_delete(3dladm)
>           dladm_part_info(3dladm)
>           dladm_ib_info(3dladm)
>           dladm_part_up(3dladm)
>
>           ibt_get_part_attrs(9f)
>
> 4.3.6 Interface table
> ---------------------
>
>   -----------------------------------------------------------------------
> |    Interface name                      |  Commitment Level            |
>   -----------------------------------------------------------------------
> |  dladm(1M)  extensions                                                |
>   -----------------------------------------------------------------------
> |      create-part                       |                              |
> |      delete-part                       | committed                    |
> |      show-ib                           |                              |
> |      show-part                         |                              |
>   -----------------------------------------------------------------------
> |  libdladm extensions   (libdlib.h)                                    |
>   -----------------------------------------------------------------------
> |     dladm_part_create                  |                              |
> |     dladm_part_delete                  |                              |
> |     dladm_part_show                    |                              |
> |     dladm_show_ib                      | ON Consolidation Private     |
> |     dladm_part_up                      |                              |
> |     dladm_part_attr_t                  |                              |
> |     dladm_ib_attr_t                    |                              |
> |     DLADM_IBPART_FORCE_CREATE          |                              |
>   -----------------------------------------------------------------------
> |  InfiniBand Specific Link Properties                                  |
>   -----------------------------------------------------------------------
> |     linkmode                           | ON Consolidation Private     |
>   -----------------------------------------------------------------------
> |  libdlmgmt.h                                                          |
>   -----------------------------------------------------------------------
> |     DATALINK_CLASS_IBPART              | ON Consolidation Private     |
>   -----------------------------------------------------------------------
> |  libdladm.h                                                           |
>   -----------------------------------------------------------------------
> |     DLADM_STATUS_INVALID_PORT_INSTANCE |                              |
> |     DLADM_STATUS_PORT_IS_DOWN          |                              |
> |     DLADM_STATUS_PKEY_NOT_PRESENT      |                              |
> |     DLADM_STATUS_PARTITION_EXISTS      | ON Consolidation Private     |
> |     DLADM_STATUS_INVALID_PKEY          |                              |
> |     DLADM_STATUS_NO_HW_RESOURCE        |                              |
> |     DLADM_STATUS_INVALID_PKEY_TBL_SIZE |                              |
>   -----------------------------------------------------------------------
> |  IBTF extensions                                                      |
>   -----------------------------------------------------------------------
> |      ibt_get_part_attrs()              |                              |
> |      ibt_get_all_part_attrs()          | ON Consolidation private     |
> |      ibt_free_part_attrs()             |                              |
> |      ibt_part_attr_t                   |                              |
>   -----------------------------------------------------------------------
> |   New status codes (ibt_status_t updates)                             |
>   -----------------------------------------------------------------------
> |      IBT_NO_SUCH_OBJECT                | ON Consolidation private     |
>   -----------------------------------------------------------------------
>
> 4.3.7 References
> ---------------------
>
> PSARC 2001/289 IP over InfiniBand
> PSARC 2007/636 IPoIB Conversion to GLDv3
> PSARC 2009/593 IPoIB Connected Mode
>
> 6. Resources and Schedule
>      6.4. Steering Committee requested information
>       6.4.1. Consolidation C-team Name:
>               ON
>      6.5. ARC review type: FastTrack
>      6.6. ARC Exposure: open
>
>    

Reply via email to