Okay, here are some questions; some of these stem from my lack of IPoIB knowledge, so I hope you'll pardon me if I ask things that seem obvious or stupid.
1) Will the new "phys" objects being created here be "snoop"-able? (I understand that they can't transmit data, but can they operate in some sort of promiscuous receive mode?) (Oh wait, it seems that this might be answered in part by 4.3.1. It looks like the answer is "not yet, but in the future". Correct?) 2) From my experience with "hermon", hermon seems mostly to treat both ports as a single registered entity to the IB framework. I'm presuming that this won't preclude IPoIB from being able to identify which port is which? Are there any special considerations here for automatic path migration? 3) Are there any constraints on the format of the name used for "part-link" in dladm? The examples seem to show a specific format, but can customers choose any name they want? 4) The way this is designed seems to depend on a notion of a physical port. Does IPoIB have anything that is morally equivalent to ethernet aggregations (trunking)? I'm thinking also about link redundancy as well as bandwidth multiplication. - Garrett On 03/10/10 03:52 PM, Ted Kim wrote: > Template Version: @(#)sac_nextcase 1.69 02/15/10 SMI > This information is Copyright 2010 Sun Microsystems > 1. Introduction > 1.1. Project/Component Working Name: > IPoIB Administration Enhancement > 1.2. Name of Document Author/Supplier: > Author: Sudhakar Dindukurti > 1.3 Date of This Document: > 10 March, 2010 > 4. Technical Description > > 4.1 Acronyms > > HCA : Host Channel Adaptor > P_Key : Partition Key > IPoIB : IP over InfiniBand > GUID : Global Unique Identifier > IBTF : InfiniBand Transport Framework > IBCM : InfiniBand Communication Manager > SDP : Sockets Direct Protocol > IBD : current IPoIB Solaris driver name > ULP : Upper level Protocol > > 4.2 Requirements/Motivation > --------------------------- > > 4.2.1 Consistent IBD device node naming across nodes in a clusters > > (Amber Road Requirement (CR 6864899)) > > AmberRoad clustering software requires that datalink name for a specific > partition on multiple nodes (identical h/w configuration) to be same. For > example, if datalink name on node 1 for a P_key 0x8001 on HCA 1 and port 1 > is ibnet0, then the clustering software expects that datalink name for > P_Key 0x8001 on HCA 1 and port 1 is also ibnet0 on the node 2 in the > cluster. This is very difficult to achieve today. The problem is that > IPoIB device name is constructed automatically by IPoIB driver by > appending the instance number to the driver name (ex ibd0, ibd1, etc). So, > the clustering software does not have any control on the IPoIB device > name space. Also, IBTF framework does not guarantee that the same names > across multiple nodes. > > 4.2.2 Problem diagnosis issue > > IPoIB (PSARC 2001/289, PSARC 2009/593 and PSARC 2007/636) is implemented > as the ibd(7D) driver in Solaris. When customers face problems in > bringing up the IBD devices on Solaris, they are advised to take number > of steps to gather information about the IPoIB link that results in > usability issues. Today, users need to run cfgadm(1M) to obtain the port > GUID information, do 'ls-l' on the /dev/ibd* nodes to determine the IB > partition that an IBD device belongs to etc. There are also instances > when it is necessary to obtain the P_key corresponding to the IBD device > and HCA port that it is bound to. So, we want to improve the process of > gathering IPoIB related configuration information. > > 4.2.3 Update IBD driver to use Brussels framework (CR 6883212) > > IPoIB tunables are managed today through /etc/system file or ibd.conf. > For example, to modify the 'linkmode' of the IPoIB link, one needs to edit > ibd.conf and reboot the system. We want to replace this interface so that > user can manage these tunables using dladm(1M). > > 4.3 Proposal > > A micro/patch binding is asserted for this proposal. > > 4.3.0 IPoIB administration with dladm(1M) > ---------------------------------------- > > This case proposes a new IPoIB administration mechanism for InfiniBand > network datalinks using dladm(1M) command. Also, it proposes to add > a consolidation private InfiniBand specific library API to libdladm(3LIB). > The mechanism is very similar to the existing VLAN, VNIC, etc. > administration dladm(1M) sub-commands. > > In the new model, two classes of IPoIB datalinks will exist: > > 1. Datalinks representing the physical IB ports, which will use the > existing "phys" class that is used for Ethernet, WiFi, and other > physical media. As with all "phys" class objects, the system > will create these automatically. > > 2. Datalinks representing the administratively created partitions > over "phys" IPoIB objects, which will use a new "part" class. > Each IB partition datalink will be associated with a P_Key in a > manner analogous to the way each Ethernet VLAN datalink is > associated with a VLAN ID. > > Note that unlike other "phys" class objects, IB "phys" objects cannot > send data (since IB requires a P_Key to send data) and thus cannot be > plumbed. > > Example configuration used to explain the new IPoIB administration model > > -------------------------------------- > | IB switch | > -------------------------------------- > | | > | | > | | > Port1 | | Port2 Port1 | | Port2 > -------------------------------------- > | HCA1 HCA2 | > | Node 1 | > -------------------------------------- > > Port 2 of the HCA1 and Port 1 of the HCA2 are connected to IB switch and > SM is running the switch. Each port is configured with two P_Key's (0xffff > & 0x8001). > > > 4.3.1 Physical datalinks > ------------------------- > > One physical datalink will be created by default per port per HCA. These > physical links serve as administrative& observability data points. These > IB physical datalinks allow creating IB partitions over them similar > to creating VNICs on Ethernet physical links or link aggregations. IB > physical datalinks are not used for data transfers. So, the plumb and > assigning a IB address are not supported on these links. In future, these > physical datalinks can be used for 1) implementing the port level > statistics 2) implementing port level snoop etc. > > Example 1. # dladm show-phys > LINK MEDIA STATE SPEED DUPLEX DEVICE > ibp0 InfiniBand up 8000 unknown ibp0 > ibp1 InfiniBand down 8000 unknown ibp1 > ibp2 InfiniBand down 2000 unknown ibp2 > ibp3 InfiniBand up 2000 unknown ibp3 > > The state of the physical link is directly corresponds to state of the IB > HCA port. As you might expect other generic sub-commands such as rename- > link, show-link, delete-phys, etc. also work on IB datalinks. > > 4.3.2 IB partition Objects > -------------------------- > > IB partition objects represent a new "part" class of datalink and these > objects are managed using the new dladm(1M) sub-commands. All the new > sub-command interfaces are similar to VLAN/VNIC dladm(1M) sub-commands. IB > partition datalinks can be created on the top of IB physical links one > per each P_Key on the port. These links are used for data transfers. > The updated dladm(1M) man page describes the different commands available > for managing the IB partition links. > > 4.3.2.1 create-part > > Creates a new IB partition Object with the specified datalink name. > > Example 2: Create a IB partition link for the P_Key 0x8001 on the > top of ibp0 physical datalink > > # dladm create-part -l ibp0 -P 0x8001 p8001.ibp0 > > The above commands succeeds if the port is "up", P_Key is present on > the port and IPoIB successfully completes its initialization. On > subsequent reboots after successful initial creation, the partition > will be available even if port is "down" or P_Key is absent, though > the datalink state will be marked as "down". > > Example 3: Create an IB partition link for the P_Key 0x9000 on the > top of ibp2 > > # dladm create-part -f -l ibp2 -P 0x9000 p9000.ibp2 > > The force option "-f" allows to create the IB partition even when > the P_Key is not present or Port is down. The link state will be > marked as down. The link state will be updated to "up" when P_Key > is added to the port and port is activated. > > Example 4: Plumb and assign a IP address to IB partition p9000.ibp2 > > # ifconfig p9000.ibp2 plumb up > > # ifconfig -a > > p9000.ibp2: flags=1000843<UP,BROADCAST,RUNNING, > MULTICAST,IPv4> mtu 2044 index 3 > inet 1.1.1.1 netmask ff000000 broadcast 1.255.255.255 > > Example 5: Display the partition links using show-link command > > # dladm show-link > LINK CLASS MTU STATE BRIDGE OVER > p8001.ibp0 part 65520 unknown -- ibp0 > p9000.ibp2 part 65520 down -- ibp2 > > 4.3.2.2 delete-part > > Deletes a specified IB partition object > > Example 6: Delete the partition p8001.ibp0 > > # dladm delete-part p8001.ibp0 > > The above command deletes the partition. > > Example 7: Show the partition link information after > deleting p8001.ibp0 > > # dldam show-part > > LINK PKEY OVER STATE FLAGS > p9000.ibp2 9000 ibp2 down f--- > > 4.3.2.3 show-part > > Displays IB partition object information > > Example 8: Display the IB partition links information (below output > is for the "part" operations in Example 2, 3& 4) > > # dldam show-part > > LINK PKEY OVER STATE FLAGS > p8001.ibp0 8001 ibp0 unknown ---- > p9000.ibp2 9000 ibp2 down f--- > > The state of the IB partition link will be "unknown" after IB partition > is created and before IB partition is plumbed. Once partition is plumbed > the link state will be set to "up" when the link is ready to use. The > state of the link will be set to "down" if 1) HCA port is down 2) P_Key > is absent or 3) broadcast group is absent. > > 4.3.2.4 show-ib > > Displays IB specific information such as port#, port guid, etc. > > Example 9: Show IB specific information > > # dladm show-ib > LINK HCAGUID PORTGUID PORT STATE PKEYS > ibp0 3BA000100CD7C 3BA000100CD7D 1 down FFFF > ibp1 3BA000100CD7C 3BA000100CD7E 2 down FFFF > ibp3 5AD0000033634 5AD0000033636 2 up FFFF,8001 > ibp2 5AD0000033634 5AD0000033635 1 up FFFF,8001 > > show-ib commands display only the physical links, port GUID, port# > HCA GUID, and P_Key present on the port at the time of running the > command. > > 4.3.3 IB partition object Administration Library > > The libdladm(3LIB) library is currently used to implement datalink > administration for all the GLDv3 datalinks (VNIC, link aggregation, > wireless, IP Tunnel, etc.). The library will be further enhanced to provide > administrative interfaces of InfiniBand partition objects. All the new > library extensions are similar to VLAN/VNIC libdladm extensions. IB > specific functionality will be implemented by libdlib.c and its interface > is provided via libdlib.h. dladm(1M) will use this library for managing > IB partitions. The IB partition administration library provides a > persistent repository which allows IB partition configuration to be stored > across reboots. It uses the existing dladm(1M) /etc/dladm/datalink.conf > repository to store the IB partition configuration. > > The list of InfiniBand specific extensions to libdladm library is given below. > For more details of each API, see man pages in the PSARC materials > directory. The man pages are only for PSARC review (not intended for public > use). > > 4.3.3.1 dladm_part_create() > Create a IB partition object > > 4.3.3.2 dladm_part_delete() > Deletes a IB partition object > > 4.3.3.3 dladm_part_info() > Returns IB partition Object attributes > > 4.3.3.4 dladm_ib_info() > Returns IB specific attributes such as port number, port guid, > and HCA GUID. > > 4.3.3.5 dladm_part_up() > Brings up one or all the IB partition objects during every boot. > > 4.3.4 IBTF Extensions > --------------------- > > Some of the IB ULP's such as SDP, IBCM, etc. walk the device tree and read > "port-pkey", "hca-guid", and "port-guid" properties from the ibd(7D) device > instance. With the new model, IB partitions are just objects (no longer > device nodes). So, IB ULP's no longer can retrieve partition attributes > from the device node. The following new IBTF extensions provide a > mechanism to retrieve the partition attributes in the kernel. See > ibt_get_part_attrs() man page in the PSARC case directory for more details > about the following API. The man pages are only for PSARC review (not > intended for public use). > > 4.3.4.1 ibt_get_part_attrs() > Returns the attributes of a requested IB partition > > 4.3.4.2 ibt_get_all_part_attrs() > Returns the attributes of all the active IB partitions > > 4.3.4.3 ibt_free_part_attrs() > Frees the memory for partition attribute structure allocated by > ibt_get_all_part_attrs() > > 4.3.5 Man page changes > ---------------------- > > Updated man pages - published > dladm(1M) > datadm(1M) > dat.conf(4) > ibp(7D) (ibd(7d) name changed) > > New Man pages - internal only > dladm_part_create(3dladm) > dladm_part_delete(3dladm) > dladm_part_info(3dladm) > dladm_ib_info(3dladm) > dladm_part_up(3dladm) > > ibt_get_part_attrs(9f) > > 4.3.6 Interface table > --------------------- > > ----------------------------------------------------------------------- > | Interface name | Commitment Level | > ----------------------------------------------------------------------- > | dladm(1M) extensions | > ----------------------------------------------------------------------- > | create-part | | > | delete-part | committed | > | show-ib | | > | show-part | | > ----------------------------------------------------------------------- > | libdladm extensions (libdlib.h) | > ----------------------------------------------------------------------- > | dladm_part_create | | > | dladm_part_delete | | > | dladm_part_show | | > | dladm_show_ib | ON Consolidation Private | > | dladm_part_up | | > | dladm_part_attr_t | | > | dladm_ib_attr_t | | > | DLADM_IBPART_FORCE_CREATE | | > ----------------------------------------------------------------------- > | InfiniBand Specific Link Properties | > ----------------------------------------------------------------------- > | linkmode | ON Consolidation Private | > ----------------------------------------------------------------------- > | libdlmgmt.h | > ----------------------------------------------------------------------- > | DATALINK_CLASS_IBPART | ON Consolidation Private | > ----------------------------------------------------------------------- > | libdladm.h | > ----------------------------------------------------------------------- > | DLADM_STATUS_INVALID_PORT_INSTANCE | | > | DLADM_STATUS_PORT_IS_DOWN | | > | DLADM_STATUS_PKEY_NOT_PRESENT | | > | DLADM_STATUS_PARTITION_EXISTS | ON Consolidation Private | > | DLADM_STATUS_INVALID_PKEY | | > | DLADM_STATUS_NO_HW_RESOURCE | | > | DLADM_STATUS_INVALID_PKEY_TBL_SIZE | | > ----------------------------------------------------------------------- > | IBTF extensions | > ----------------------------------------------------------------------- > | ibt_get_part_attrs() | | > | ibt_get_all_part_attrs() | ON Consolidation private | > | ibt_free_part_attrs() | | > | ibt_part_attr_t | | > ----------------------------------------------------------------------- > | New status codes (ibt_status_t updates) | > ----------------------------------------------------------------------- > | IBT_NO_SUCH_OBJECT | ON Consolidation private | > ----------------------------------------------------------------------- > > 4.3.7 References > --------------------- > > PSARC 2001/289 IP over InfiniBand > PSARC 2007/636 IPoIB Conversion to GLDv3 > PSARC 2009/593 IPoIB Connected Mode > > 6. Resources and Schedule > 6.4. Steering Committee requested information > 6.4.1. Consolidation C-team Name: > ON > 6.5. ARC review type: FastTrack > 6.6. ARC Exposure: open > >