Template Version: @(#)sac_nextcase 1.69 02/15/10 SMI This information is Copyright 2010 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: IPoIB Administration Enhancement 1.2. Name of Document Author/Supplier: Author: Sudhakar Dindukurti 1.3 Date of This Document: 10 March, 2010 4. Technical Description
4.1 Acronyms HCA : Host Channel Adaptor P_Key : Partition Key IPoIB : IP over InfiniBand GUID : Global Unique Identifier IBTF : InfiniBand Transport Framework IBCM : InfiniBand Communication Manager SDP : Sockets Direct Protocol IBD : current IPoIB Solaris driver name ULP : Upper level Protocol 4.2 Requirements/Motivation --------------------------- 4.2.1 Consistent IBD device node naming across nodes in a clusters (Amber Road Requirement (CR 6864899)) AmberRoad clustering software requires that datalink name for a specific partition on multiple nodes (identical h/w configuration) to be same. For example, if datalink name on node 1 for a P_key 0x8001 on HCA 1 and port 1 is ibnet0, then the clustering software expects that datalink name for P_Key 0x8001 on HCA 1 and port 1 is also ibnet0 on the node 2 in the cluster. This is very difficult to achieve today. The problem is that IPoIB device name is constructed automatically by IPoIB driver by appending the instance number to the driver name (ex ibd0, ibd1, etc). So, the clustering software does not have any control on the IPoIB device name space. Also, IBTF framework does not guarantee that the same names across multiple nodes. 4.2.2 Problem diagnosis issue IPoIB (PSARC 2001/289, PSARC 2009/593 and PSARC 2007/636) is implemented as the ibd(7D) driver in Solaris. When customers face problems in bringing up the IBD devices on Solaris, they are advised to take number of steps to gather information about the IPoIB link that results in usability issues. Today, users need to run cfgadm(1M) to obtain the port GUID information, do 'ls-l' on the /dev/ibd* nodes to determine the IB partition that an IBD device belongs to etc. There are also instances when it is necessary to obtain the P_key corresponding to the IBD device and HCA port that it is bound to. So, we want to improve the process of gathering IPoIB related configuration information. 4.2.3 Update IBD driver to use Brussels framework (CR 6883212) IPoIB tunables are managed today through /etc/system file or ibd.conf. For example, to modify the 'linkmode' of the IPoIB link, one needs to edit ibd.conf and reboot the system. We want to replace this interface so that user can manage these tunables using dladm(1M). 4.3 Proposal A micro/patch binding is asserted for this proposal. 4.3.0 IPoIB administration with dladm(1M) ---------------------------------------- This case proposes a new IPoIB administration mechanism for InfiniBand network datalinks using dladm(1M) command. Also, it proposes to add a consolidation private InfiniBand specific library API to libdladm(3LIB). The mechanism is very similar to the existing VLAN, VNIC, etc. administration dladm(1M) sub-commands. In the new model, two classes of IPoIB datalinks will exist: 1. Datalinks representing the physical IB ports, which will use the existing "phys" class that is used for Ethernet, WiFi, and other physical media. As with all "phys" class objects, the system will create these automatically. 2. Datalinks representing the administratively created partitions over "phys" IPoIB objects, which will use a new "part" class. Each IB partition datalink will be associated with a P_Key in a manner analogous to the way each Ethernet VLAN datalink is associated with a VLAN ID. Note that unlike other "phys" class objects, IB "phys" objects cannot send data (since IB requires a P_Key to send data) and thus cannot be plumbed. Example configuration used to explain the new IPoIB administration model -------------------------------------- | IB switch | -------------------------------------- | | | | | | Port1 | | Port2 Port1 | | Port2 -------------------------------------- | HCA1 HCA2 | | Node 1 | -------------------------------------- Port 2 of the HCA1 and Port 1 of the HCA2 are connected to IB switch and SM is running the switch. Each port is configured with two P_Key's (0xffff & 0x8001). 4.3.1 Physical datalinks ------------------------- One physical datalink will be created by default per port per HCA. These physical links serve as administrative & observability data points. These IB physical datalinks allow creating IB partitions over them similar to creating VNICs on Ethernet physical links or link aggregations. IB physical datalinks are not used for data transfers. So, the plumb and assigning a IB address are not supported on these links. In future, these physical datalinks can be used for 1) implementing the port level statistics 2) implementing port level snoop etc. Example 1. # dladm show-phys LINK MEDIA STATE SPEED DUPLEX DEVICE ibp0 InfiniBand up 8000 unknown ibp0 ibp1 InfiniBand down 8000 unknown ibp1 ibp2 InfiniBand down 2000 unknown ibp2 ibp3 InfiniBand up 2000 unknown ibp3 The state of the physical link is directly corresponds to state of the IB HCA port. As you might expect other generic sub-commands such as rename- link, show-link, delete-phys, etc. also work on IB datalinks. 4.3.2 IB partition Objects -------------------------- IB partition objects represent a new "part" class of datalink and these objects are managed using the new dladm(1M) sub-commands. All the new sub-command interfaces are similar to VLAN/VNIC dladm(1M) sub-commands. IB partition datalinks can be created on the top of IB physical links one per each P_Key on the port. These links are used for data transfers. The updated dladm(1M) man page describes the different commands available for managing the IB partition links. 4.3.2.1 create-part Creates a new IB partition Object with the specified datalink name. Example 2: Create a IB partition link for the P_Key 0x8001 on the top of ibp0 physical datalink # dladm create-part -l ibp0 -P 0x8001 p8001.ibp0 The above commands succeeds if the port is "up", P_Key is present on the port and IPoIB successfully completes its initialization. On subsequent reboots after successful initial creation, the partition will be available even if port is "down" or P_Key is absent, though the datalink state will be marked as "down". Example 3: Create an IB partition link for the P_Key 0x9000 on the top of ibp2 # dladm create-part -f -l ibp2 -P 0x9000 p9000.ibp2 The force option "-f" allows to create the IB partition even when the P_Key is not present or Port is down. The link state will be marked as down. The link state will be updated to "up" when P_Key is added to the port and port is activated. Example 4: Plumb and assign a IP address to IB partition p9000.ibp2 # ifconfig p9000.ibp2 plumb up # ifconfig -a p9000.ibp2: flags=1000843<UP,BROADCAST,RUNNING, MULTICAST,IPv4> mtu 2044 index 3 inet 1.1.1.1 netmask ff000000 broadcast 1.255.255.255 Example 5: Display the partition links using show-link command # dladm show-link LINK CLASS MTU STATE BRIDGE OVER p8001.ibp0 part 65520 unknown -- ibp0 p9000.ibp2 part 65520 down -- ibp2 4.3.2.2 delete-part Deletes a specified IB partition object Example 6: Delete the partition p8001.ibp0 # dladm delete-part p8001.ibp0 The above command deletes the partition. Example 7: Show the partition link information after deleting p8001.ibp0 # dldam show-part LINK PKEY OVER STATE FLAGS p9000.ibp2 9000 ibp2 down f--- 4.3.2.3 show-part Displays IB partition object information Example 8: Display the IB partition links information (below output is for the "part" operations in Example 2, 3 & 4) # dldam show-part LINK PKEY OVER STATE FLAGS p8001.ibp0 8001 ibp0 unknown ---- p9000.ibp2 9000 ibp2 down f--- The state of the IB partition link will be "unknown" after IB partition is created and before IB partition is plumbed. Once partition is plumbed the link state will be set to "up" when the link is ready to use. The state of the link will be set to "down" if 1) HCA port is down 2) P_Key is absent or 3) broadcast group is absent. 4.3.2.4 show-ib Displays IB specific information such as port#, port guid, etc. Example 9: Show IB specific information # dladm show-ib LINK HCAGUID PORTGUID PORT STATE PKEYS ibp0 3BA000100CD7C 3BA000100CD7D 1 down FFFF ibp1 3BA000100CD7C 3BA000100CD7E 2 down FFFF ibp3 5AD0000033634 5AD0000033636 2 up FFFF,8001 ibp2 5AD0000033634 5AD0000033635 1 up FFFF,8001 show-ib commands display only the physical links, port GUID, port# HCA GUID, and P_Key present on the port at the time of running the command. 4.3.3 IB partition object Administration Library The libdladm(3LIB) library is currently used to implement datalink administration for all the GLDv3 datalinks (VNIC, link aggregation, wireless, IP Tunnel, etc.). The library will be further enhanced to provide administrative interfaces of InfiniBand partition objects. All the new library extensions are similar to VLAN/VNIC libdladm extensions. IB specific functionality will be implemented by libdlib.c and its interface is provided via libdlib.h. dladm(1M) will use this library for managing IB partitions. The IB partition administration library provides a persistent repository which allows IB partition configuration to be stored across reboots. It uses the existing dladm(1M) /etc/dladm/datalink.conf repository to store the IB partition configuration. The list of InfiniBand specific extensions to libdladm library is given below. For more details of each API, see man pages in the PSARC materials directory. The man pages are only for PSARC review (not intended for public use). 4.3.3.1 dladm_part_create() Create a IB partition object 4.3.3.2 dladm_part_delete() Deletes a IB partition object 4.3.3.3 dladm_part_info() Returns IB partition Object attributes 4.3.3.4 dladm_ib_info() Returns IB specific attributes such as port number, port guid, and HCA GUID. 4.3.3.5 dladm_part_up() Brings up one or all the IB partition objects during every boot. 4.3.4 IBTF Extensions --------------------- Some of the IB ULP's such as SDP, IBCM, etc. walk the device tree and read "port-pkey", "hca-guid", and "port-guid" properties from the ibd(7D) device instance. With the new model, IB partitions are just objects (no longer device nodes). So, IB ULP's no longer can retrieve partition attributes from the device node. The following new IBTF extensions provide a mechanism to retrieve the partition attributes in the kernel. See ibt_get_part_attrs() man page in the PSARC case directory for more details about the following API. The man pages are only for PSARC review (not intended for public use). 4.3.4.1 ibt_get_part_attrs() Returns the attributes of a requested IB partition 4.3.4.2 ibt_get_all_part_attrs() Returns the attributes of all the active IB partitions 4.3.4.3 ibt_free_part_attrs() Frees the memory for partition attribute structure allocated by ibt_get_all_part_attrs() 4.3.5 Man page changes ---------------------- Updated man pages - published dladm(1M) datadm(1M) dat.conf(4) ibp(7D) (ibd(7d) name changed) New Man pages - internal only dladm_part_create(3dladm) dladm_part_delete(3dladm) dladm_part_info(3dladm) dladm_ib_info(3dladm) dladm_part_up(3dladm) ibt_get_part_attrs(9f) 4.3.6 Interface table --------------------- ----------------------------------------------------------------------- | Interface name | Commitment Level | ----------------------------------------------------------------------- | dladm(1M) extensions | ----------------------------------------------------------------------- | create-part | | | delete-part | committed | | show-ib | | | show-part | | ----------------------------------------------------------------------- | libdladm extensions (libdlib.h) | ----------------------------------------------------------------------- | dladm_part_create | | | dladm_part_delete | | | dladm_part_show | | | dladm_show_ib | ON Consolidation Private | | dladm_part_up | | | dladm_part_attr_t | | | dladm_ib_attr_t | | | DLADM_IBPART_FORCE_CREATE | | ----------------------------------------------------------------------- | InfiniBand Specific Link Properties | ----------------------------------------------------------------------- | linkmode | ON Consolidation Private | ----------------------------------------------------------------------- | libdlmgmt.h | ----------------------------------------------------------------------- | DATALINK_CLASS_IBPART | ON Consolidation Private | ----------------------------------------------------------------------- | libdladm.h | ----------------------------------------------------------------------- | DLADM_STATUS_INVALID_PORT_INSTANCE | | | DLADM_STATUS_PORT_IS_DOWN | | | DLADM_STATUS_PKEY_NOT_PRESENT | | | DLADM_STATUS_PARTITION_EXISTS | ON Consolidation Private | | DLADM_STATUS_INVALID_PKEY | | | DLADM_STATUS_NO_HW_RESOURCE | | | DLADM_STATUS_INVALID_PKEY_TBL_SIZE | | ----------------------------------------------------------------------- | IBTF extensions | ----------------------------------------------------------------------- | ibt_get_part_attrs() | | | ibt_get_all_part_attrs() | ON Consolidation private | | ibt_free_part_attrs() | | | ibt_part_attr_t | | ----------------------------------------------------------------------- | New status codes (ibt_status_t updates) | ----------------------------------------------------------------------- | IBT_NO_SUCH_OBJECT | ON Consolidation private | ----------------------------------------------------------------------- 4.3.7 References --------------------- PSARC 2001/289 IP over InfiniBand PSARC 2007/636 IPoIB Conversion to GLDv3 PSARC 2009/593 IPoIB Connected Mode 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack 6.6. ARC Exposure: open