Thanks for the answers. +1 on the case, with the added comment (for the record), that it might be a good idea to research the possibility of providing trunking/aggregation similar features over IB links. I'd also like to see snoop and statistic support on the phys link in the near future; I hope that is actively planned for a follow-up, and not just a hypothetical "we can do it if we want to" sort of thing.
- Garrett On 03/12/10 09:45 AM, Sudhakar Dindukurti wrote: > Garrett, > > On 03/12/10 08:40, Garrett D'Amore wrote: >> Okay, here are some questions; some of these stem from my lack of >> IPoIB knowledge, so I hope you'll pardon me if I ask things that seem >> obvious or stupid. >> >> 1) Will the new "phys" objects being created here be "snoop"-able? >> (I understand that they can't transmit data, but can they operate in >> some sort of promiscuous receive mode?) (Oh wait, it seems that this >> might be answered in part by 4.3.1. It looks like the answer is >> "not yet, but in the future". Correct?) >> > snoop support on the "phys" link is not supported yet but can be done > in the future. > >> 2) From my experience with "hermon", hermon seems mostly to treat >> both ports as a single registered entity to the IB framework. I'm >> presuming that this won't preclude IPoIB from being able to identify >> which port is which? > Yes. dladm show-ib displays the port information associated with the > IB phys class datalinks. >> Are there any special considerations here for automatic path migration? > IPoIB instances can only be created one per P_Key per port per IB HCA. > Automatic path migration (APM) feature is supported from one port to > another port on the same HCA. So, APM is not applicable for IPoIB. > This case does not change any of this behavior from what exists today. >> >> 3) Are there any constraints on the format of the name used for >> "part-link" in dladm? The examples seem to show a specific format, >> but can customers choose any name they want? > Yes. The customer can choose any name they want as long as the name > format adheres to the existing "datalink" name. >> >> 4) The way this is designed seems to depend on a notion of a physical >> port. Does IPoIB have anything that is morally equivalent to >> ethernet aggregations (trunking)? I'm thinking also about link >> redundancy as well as bandwidth multiplication. >> > No. This case does not add/change the existing IPoIB behavior on > these features. > > regards, > Sudhakar > >> - Garrett >> >> On 03/10/10 03:52 PM, Ted Kim wrote: >>> Template Version: @(#)sac_nextcase 1.69 02/15/10 SMI >>> This information is Copyright 2010 Sun Microsystems >>> 1. Introduction >>> 1.1. Project/Component Working Name: >>> IPoIB Administration Enhancement >>> 1.2. Name of Document Author/Supplier: >>> Author: Sudhakar Dindukurti >>> 1.3 Date of This Document: >>> 10 March, 2010 >>> 4. Technical Description >>> >>> 4.1 Acronyms >>> >>> HCA : Host Channel Adaptor >>> P_Key : Partition Key >>> IPoIB : IP over InfiniBand >>> GUID : Global Unique Identifier >>> IBTF : InfiniBand Transport Framework >>> IBCM : InfiniBand Communication Manager >>> SDP : Sockets Direct Protocol >>> IBD : current IPoIB Solaris driver name >>> ULP : Upper level Protocol >>> >>> 4.2 Requirements/Motivation >>> --------------------------- >>> >>> 4.2.1 Consistent IBD device node naming across nodes in a clusters >>> >>> (Amber Road Requirement (CR 6864899)) >>> >>> AmberRoad clustering software requires that datalink name for a >>> specific >>> partition on multiple nodes (identical h/w configuration) to be >>> same. For >>> example, if datalink name on node 1 for a P_key 0x8001 on HCA 1 and >>> port 1 >>> is ibnet0, then the clustering software expects that datalink >>> name for >>> P_Key 0x8001 on HCA 1 and port 1 is also ibnet0 on the node 2 >>> in the >>> cluster. This is very difficult to achieve today. The problem is >>> that >>> IPoIB device name is constructed automatically by IPoIB >>> driver by >>> appending the instance number to the driver name (ex ibd0, ibd1, >>> etc). So, >>> the clustering software does not have any control on the IPoIB >>> device >>> name space. Also, IBTF framework does not guarantee that the same >>> names >>> across multiple nodes. >>> >>> 4.2.2 Problem diagnosis issue >>> >>> IPoIB (PSARC 2001/289, PSARC 2009/593 and PSARC 2007/636) is >>> implemented >>> as the ibd(7D) driver in Solaris. When customers face >>> problems in >>> bringing up the IBD devices on Solaris, they are advised to take >>> number >>> of steps to gather information about the IPoIB link that >>> results in >>> usability issues. Today, users need to run cfgadm(1M) to obtain the >>> port >>> GUID information, do 'ls-l' on the /dev/ibd* nodes to determine >>> the IB >>> partition that an IBD device belongs to etc. There are also >>> instances >>> when it is necessary to obtain the P_key corresponding to the IBD >>> device >>> and HCA port that it is bound to. So, we want to improve the >>> process of >>> gathering IPoIB related configuration information. >>> >>> 4.2.3 Update IBD driver to use Brussels framework (CR 6883212) >>> >>> IPoIB tunables are managed today through /etc/system file or >>> ibd.conf. >>> For example, to modify the 'linkmode' of the IPoIB link, one needs >>> to edit >>> ibd.conf and reboot the system. We want to replace this interface so >>> that >>> user can manage these tunables using dladm(1M). >>> >>> 4.3 Proposal >>> >>> A micro/patch binding is asserted for this proposal. >>> >>> 4.3.0 IPoIB administration with dladm(1M) >>> ---------------------------------------- >>> >>> This case proposes a new IPoIB administration mechanism for >>> InfiniBand >>> network datalinks using dladm(1M) command. Also, it proposes >>> to add >>> a consolidation private InfiniBand specific library API to >>> libdladm(3LIB). >>> The mechanism is very similar to the existing VLAN, VNIC, >>> etc. >>> administration dladm(1M) sub-commands. >>> >>> In the new model, two classes of IPoIB datalinks will exist: >>> >>> 1. Datalinks representing the physical IB ports, which will >>> use the >>> existing "phys" class that is used for Ethernet, WiFi, and >>> other >>> physical media. As with all "phys" class objects, the >>> system >>> will create these automatically. >>> >>> 2. Datalinks representing the administratively created >>> partitions >>> over "phys" IPoIB objects, which will use a new "part" >>> class. >>> Each IB partition datalink will be associated with a >>> P_Key in a >>> manner analogous to the way each Ethernet VLAN >>> datalink is >>> associated with a VLAN ID. >>> >>> Note that unlike other "phys" class objects, IB "phys" objects >>> cannot >>> send data (since IB requires a P_Key to send data) and thus >>> cannot be >>> plumbed. >>> >>> Example configuration used to explain the new IPoIB administration >>> model >>> >>> -------------------------------------- >>> | IB switch | >>> -------------------------------------- >>> | | >>> | | >>> | | >>> Port1 | | Port2 Port1 | | Port2 >>> -------------------------------------- >>> | HCA1 HCA2 | >>> | Node 1 | >>> -------------------------------------- >>> >>> Port 2 of the HCA1 and Port 1 of the HCA2 are connected to IB >>> switch and >>> SM is running the switch. Each port is configured with two P_Key's >>> (0xffff >>> & 0x8001). >>> >>> >>> 4.3.1 Physical datalinks >>> ------------------------- >>> >>> One physical datalink will be created by default per port per HCA. >>> These >>> physical links serve as administrative& observability data points. >>> These >>> IB physical datalinks allow creating IB partitions over them >>> similar >>> to creating VNICs on Ethernet physical links or link >>> aggregations. IB >>> physical datalinks are not used for data transfers. So, the >>> plumb and >>> assigning a IB address are not supported on these links. In future, >>> these >>> physical datalinks can be used for 1) implementing the port >>> level >>> statistics 2) implementing port level snoop etc. >>> >>> Example 1. # dladm show-phys >>> LINK MEDIA STATE SPEED DUPLEX DEVICE >>> ibp0 InfiniBand up 8000 unknown ibp0 >>> ibp1 InfiniBand down 8000 unknown ibp1 >>> ibp2 InfiniBand down 2000 unknown ibp2 >>> ibp3 InfiniBand up 2000 unknown ibp3 >>> >>> The state of the physical link is directly corresponds to state of >>> the IB >>> HCA port. As you might expect other generic sub-commands such as >>> rename- >>> link, show-link, delete-phys, etc. also work on IB datalinks. >>> >>> 4.3.2 IB partition Objects >>> -------------------------- >>> >>> IB partition objects represent a new "part" class of datalink and >>> these >>> objects are managed using the new dladm(1M) sub-commands. All >>> the new >>> sub-command interfaces are similar to VLAN/VNIC dladm(1M) >>> sub-commands. IB >>> partition datalinks can be created on the top of IB physical >>> links one >>> per each P_Key on the port. These links are used for data >>> transfers. >>> The updated dladm(1M) man page describes the different commands >>> available >>> for managing the IB partition links. >>> >>> 4.3.2.1 create-part >>> >>> Creates a new IB partition Object with the specified datalink >>> name. >>> >>> Example 2: Create a IB partition link for the P_Key 0x8001 >>> on the >>> top of ibp0 physical datalink >>> >>> # dladm create-part -l ibp0 -P 0x8001 p8001.ibp0 >>> >>> The above commands succeeds if the port is "up", P_Key is >>> present on >>> the port and IPoIB successfully completes its >>> initialization. On >>> subsequent reboots after successful initial creation, the >>> partition >>> will be available even if port is "down" or P_Key is absent, >>> though >>> the datalink state will be marked as "down". >>> >>> Example 3: Create an IB partition link for the P_Key 0x9000 >>> on the >>> top of ibp2 >>> >>> # dladm create-part -f -l ibp2 -P 0x9000 p9000.ibp2 >>> >>> The force option "-f" allows to create the IB partition >>> even when >>> the P_Key is not present or Port is down. The link state >>> will be >>> marked as down. The link state will be updated to "up" when >>> P_Key >>> is added to the port and port is activated. >>> >>> Example 4: Plumb and assign a IP address to IB partition >>> p9000.ibp2 >>> >>> # ifconfig p9000.ibp2 plumb up >>> >>> # ifconfig -a >>> >>> p9000.ibp2: flags=1000843<UP,BROADCAST,RUNNING, >>> MULTICAST,IPv4> mtu 2044 index 3 >>> inet 1.1.1.1 netmask ff000000 broadcast 1.255.255.255 >>> >>> Example 5: Display the partition links using show-link command >>> >>> # dladm show-link >>> LINK CLASS MTU STATE BRIDGE OVER >>> p8001.ibp0 part 65520 unknown -- ibp0 >>> p9000.ibp2 part 65520 down -- ibp2 >>> >>> 4.3.2.2 delete-part >>> >>> Deletes a specified IB partition object >>> >>> Example 6: Delete the partition p8001.ibp0 >>> >>> # dladm delete-part p8001.ibp0 >>> >>> The above command deletes the partition. >>> >>> Example 7: Show the partition link information after >>> deleting p8001.ibp0 >>> >>> # dldam show-part >>> >>> LINK PKEY OVER STATE FLAGS >>> p9000.ibp2 9000 ibp2 down f--- >>> >>> 4.3.2.3 show-part >>> >>> Displays IB partition object information >>> >>> Example 8: Display the IB partition links information (below >>> output >>> is for the "part" operations in Example 2, 3& 4) >>> >>> # dldam show-part >>> >>> LINK PKEY OVER STATE FLAGS >>> p8001.ibp0 8001 ibp0 unknown ---- >>> p9000.ibp2 9000 ibp2 down f--- >>> >>> The state of the IB partition link will be "unknown" after IB >>> partition >>> is created and before IB partition is plumbed. Once partition is >>> plumbed >>> the link state will be set to "up" when the link is ready to >>> use. The >>> state of the link will be set to "down" if 1) HCA port is down >>> 2) P_Key >>> is absent or 3) broadcast group is absent. >>> >>> 4.3.2.4 show-ib >>> >>> Displays IB specific information such as port#, port guid, etc. >>> >>> Example 9: Show IB specific information >>> >>> # dladm show-ib >>> LINK HCAGUID PORTGUID PORT STATE PKEYS >>> ibp0 3BA000100CD7C 3BA000100CD7D 1 down FFFF >>> ibp1 3BA000100CD7C 3BA000100CD7E 2 down FFFF >>> ibp3 5AD0000033634 5AD0000033636 2 up FFFF,8001 >>> ibp2 5AD0000033634 5AD0000033635 1 up FFFF,8001 >>> >>> show-ib commands display only the physical links, port >>> GUID, port# >>> HCA GUID, and P_Key present on the port at the time of >>> running the >>> command. >>> >>> 4.3.3 IB partition object Administration Library >>> >>> The libdladm(3LIB) library is currently used to implement >>> datalink >>> administration for all the GLDv3 datalinks (VNIC, link >>> aggregation, >>> wireless, IP Tunnel, etc.). The library will be further enhanced to >>> provide >>> administrative interfaces of InfiniBand partition objects. All >>> the new >>> library extensions are similar to VLAN/VNIC libdladm >>> extensions. IB >>> specific functionality will be implemented by libdlib.c and its >>> interface >>> is provided via libdlib.h. dladm(1M) will use this library for >>> managing >>> IB partitions. The IB partition administration library >>> provides a >>> persistent repository which allows IB partition configuration to be >>> stored >>> across reboots. It uses the existing dladm(1M) >>> /etc/dladm/datalink.conf >>> repository to store the IB partition configuration. >>> >>> The list of InfiniBand specific extensions to libdladm library is >>> given below. >>> For more details of each API, see man pages in the PSARC >>> materials >>> directory. The man pages are only for PSARC review (not intended for >>> public >>> use). >>> >>> 4.3.3.1 dladm_part_create() >>> Create a IB partition object >>> >>> 4.3.3.2 dladm_part_delete() >>> Deletes a IB partition object >>> >>> 4.3.3.3 dladm_part_info() >>> Returns IB partition Object attributes >>> >>> 4.3.3.4 dladm_ib_info() >>> Returns IB specific attributes such as port number, port guid, >>> and HCA GUID. >>> >>> 4.3.3.5 dladm_part_up() >>> Brings up one or all the IB partition objects during every boot. >>> >>> 4.3.4 IBTF Extensions >>> --------------------- >>> >>> Some of the IB ULP's such as SDP, IBCM, etc. walk the device tree >>> and read >>> "port-pkey", "hca-guid", and "port-guid" properties from the ibd(7D) >>> device >>> instance. With the new model, IB partitions are just objects (no >>> longer >>> device nodes). So, IB ULP's no longer can retrieve partition >>> attributes >>> from the device node. The following new IBTF extensions >>> provide a >>> mechanism to retrieve the partition attributes in the >>> kernel. See >>> ibt_get_part_attrs() man page in the PSARC case directory for more >>> details >>> about the following API. The man pages are only for PSARC >>> review (not >>> intended for public use). >>> >>> 4.3.4.1 ibt_get_part_attrs() >>> Returns the attributes of a requested IB partition >>> >>> 4.3.4.2 ibt_get_all_part_attrs() >>> Returns the attributes of all the active IB partitions >>> >>> 4.3.4.3 ibt_free_part_attrs() >>> Frees the memory for partition attribute structure allocated by >>> ibt_get_all_part_attrs() >>> >>> 4.3.5 Man page changes >>> ---------------------- >>> >>> Updated man pages - published >>> dladm(1M) >>> datadm(1M) >>> dat.conf(4) >>> ibp(7D) (ibd(7d) name changed) >>> >>> New Man pages - internal only >>> dladm_part_create(3dladm) >>> dladm_part_delete(3dladm) >>> dladm_part_info(3dladm) >>> dladm_ib_info(3dladm) >>> dladm_part_up(3dladm) >>> >>> ibt_get_part_attrs(9f) >>> >>> 4.3.6 Interface table >>> --------------------- >>> >>> >>> ----------------------------------------------------------------------- >>> | Interface name | Commitment >>> Level | >>> >>> ----------------------------------------------------------------------- >>> | dladm(1M) >>> extensions | >>> >>> ----------------------------------------------------------------------- >>> | create-part >>> | | >>> | delete-part | >>> committed | >>> | show-ib >>> | | >>> | show-part >>> | | >>> >>> ----------------------------------------------------------------------- >>> | libdladm extensions >>> (libdlib.h) | >>> >>> ----------------------------------------------------------------------- >>> | dladm_part_create >>> | | >>> | dladm_part_delete >>> | | >>> | dladm_part_show >>> | | >>> | dladm_show_ib | ON Consolidation >>> Private | >>> | dladm_part_up >>> | | >>> | dladm_part_attr_t >>> | | >>> | dladm_ib_attr_t >>> | | >>> | DLADM_IBPART_FORCE_CREATE >>> | | >>> >>> ----------------------------------------------------------------------- >>> | InfiniBand Specific Link >>> Properties | >>> >>> ----------------------------------------------------------------------- >>> | linkmode | ON Consolidation >>> Private | >>> >>> ----------------------------------------------------------------------- >>> | >>> libdlmgmt.h | >>> >>> ----------------------------------------------------------------------- >>> | DATALINK_CLASS_IBPART | ON Consolidation >>> Private | >>> >>> ----------------------------------------------------------------------- >>> | >>> libdladm.h | >>> >>> ----------------------------------------------------------------------- >>> | DLADM_STATUS_INVALID_PORT_INSTANCE >>> | | >>> | DLADM_STATUS_PORT_IS_DOWN >>> | | >>> | DLADM_STATUS_PKEY_NOT_PRESENT >>> | | >>> | DLADM_STATUS_PARTITION_EXISTS | ON Consolidation >>> Private | >>> | DLADM_STATUS_INVALID_PKEY >>> | | >>> | DLADM_STATUS_NO_HW_RESOURCE >>> | | >>> | DLADM_STATUS_INVALID_PKEY_TBL_SIZE >>> | | >>> >>> ----------------------------------------------------------------------- >>> | IBTF >>> extensions | >>> >>> ----------------------------------------------------------------------- >>> | ibt_get_part_attrs() >>> | | >>> | ibt_get_all_part_attrs() | ON Consolidation >>> private | >>> | ibt_free_part_attrs() >>> | | >>> | ibt_part_attr_t >>> | | >>> >>> ----------------------------------------------------------------------- >>> | New status codes (ibt_status_t >>> updates) | >>> >>> ----------------------------------------------------------------------- >>> | IBT_NO_SUCH_OBJECT | ON Consolidation >>> private | >>> >>> ----------------------------------------------------------------------- >>> >>> 4.3.7 References >>> --------------------- >>> >>> PSARC 2001/289 IP over InfiniBand >>> PSARC 2007/636 IPoIB Conversion to GLDv3 >>> PSARC 2009/593 IPoIB Connected Mode >>> >>> 6. Resources and Schedule >>> 6.4. Steering Committee requested information >>> 6.4.1. Consolidation C-team Name: >>> ON >>> 6.5. ARC review type: FastTrack >>> 6.6. ARC Exposure: open >>> >> >> > >