Thanks for the answers.  +1 on the case, with the added comment (for the 
record), that it might be a good idea to research the possibility of 
providing trunking/aggregation similar features over IB links.  I'd also 
like to see snoop and statistic support on the phys link in the near 
future; I hope that is actively planned for a follow-up, and not just a 
hypothetical "we can do it if we want to" sort of thing.

     - Garrett

On 03/12/10 09:45 AM, Sudhakar Dindukurti wrote:
> Garrett,
>
> On 03/12/10 08:40, Garrett D'Amore wrote:
>> Okay, here are some questions; some of these stem from my lack of 
>> IPoIB knowledge, so I hope you'll pardon me if I ask things that seem 
>> obvious or stupid.
>>
>> 1) Will the new "phys" objects being created here be "snoop"-able?  
>> (I understand that they can't transmit data, but can they operate in 
>> some sort of promiscuous receive mode?)  (Oh wait, it seems that this 
>> might be answered in part by 4.3.1.   It looks like the answer is 
>> "not yet, but in the future".  Correct?)
>>
> snoop support on the "phys" link is not supported yet but can be done 
> in the future.
>
>> 2) From my experience with "hermon", hermon seems mostly to treat 
>> both ports as a single registered entity to the IB framework.  I'm 
>> presuming that this won't preclude IPoIB from being able to identify 
>> which port is which?
> Yes. dladm show-ib displays the port information associated with the 
> IB phys class datalinks.
>> Are there any special considerations here for automatic path migration?
> IPoIB instances can only be created one per P_Key per port per IB HCA. 
> Automatic path migration (APM) feature is supported from one port to 
> another port on the same HCA.  So, APM is not applicable for IPoIB. 
> This case does not change any of this behavior from what exists today.
>>
>> 3) Are there any constraints on the format of the name used for 
>> "part-link" in dladm?  The examples seem to show a specific format, 
>> but can customers choose any name they want?
> Yes. The customer can choose any name they want as long as the name 
> format adheres to the existing "datalink" name.
>>
>> 4) The way this is designed seems to depend on a notion of a physical 
>> port.  Does IPoIB have anything that is morally equivalent to 
>> ethernet aggregations (trunking)?  I'm thinking also about link 
>> redundancy as well as bandwidth multiplication.
>>
> No.  This case does not add/change the existing IPoIB behavior on 
> these features.
>
> regards,
> Sudhakar
>
>>     - Garrett
>>
>> On 03/10/10 03:52 PM, Ted Kim wrote:
>>> Template Version: @(#)sac_nextcase 1.69 02/15/10 SMI
>>> This information is Copyright 2010 Sun Microsystems
>>> 1. Introduction
>>>      1.1. Project/Component Working Name:
>>>      IPoIB Administration Enhancement
>>>      1.2. Name of Document Author/Supplier:
>>>      Author:  Sudhakar Dindukurti
>>>      1.3  Date of This Document:
>>>     10 March, 2010
>>> 4. Technical Description
>>>
>>> 4.1 Acronyms
>>>
>>> HCA               : Host Channel Adaptor
>>> P_Key             : Partition Key
>>> IPoIB             : IP over InfiniBand
>>> GUID              : Global Unique Identifier
>>> IBTF              : InfiniBand Transport Framework
>>> IBCM              : InfiniBand Communication Manager
>>> SDP               : Sockets Direct Protocol
>>> IBD               : current IPoIB Solaris driver name
>>> ULP               : Upper level Protocol
>>>
>>> 4.2 Requirements/Motivation
>>> ---------------------------
>>>
>>> 4.2.1 Consistent IBD device node naming across nodes in a clusters
>>>
>>> (Amber Road Requirement (CR 6864899))
>>>
>>> AmberRoad  clustering software requires that datalink name for a 
>>> specific
>>> partition on multiple nodes (identical h/w configuration) to be 
>>> same. For
>>> example, if datalink name on node 1 for a P_key 0x8001 on HCA 1 and 
>>> port 1
>>> is ibnet0,  then the  clustering software  expects that datalink 
>>> name for
>>> P_Key  0x8001 on  HCA 1 and  port 1 is  also  ibnet0 on the node 2 
>>> in the
>>> cluster. This is  very difficult to  achieve  today. The problem is  
>>> that
>>> IPoIB  device  name is  constructed  automatically  by  IPoIB  
>>> driver  by
>>> appending the instance number to the driver name (ex ibd0, ibd1, 
>>> etc). So,
>>> the clustering  software does not  have any  control on the  IPoIB 
>>> device
>>> name space. Also,  IBTF framework does not  guarantee that the same 
>>> names
>>> across multiple nodes.
>>>
>>> 4.2.2 Problem diagnosis issue
>>>
>>> IPoIB (PSARC 2001/289,  PSARC 2009/593 and PSARC 2007/636) is 
>>> implemented
>>> as the  ibd(7D)  driver in  Solaris.  When  customers  face  
>>> problems  in
>>> bringing up the  IBD devices  on Solaris, they are advised to take 
>>> number
>>> of  steps to  gather information  about the  IPoIB link  that  
>>> results in
>>> usability issues. Today, users need to run cfgadm(1M) to  obtain the 
>>> port
>>> GUID  information, do  'ls-l' on the  /dev/ibd* nodes to determine 
>>> the IB
>>> partition  that an  IBD device  belongs to etc. There are also  
>>> instances
>>> when it is  necessary to obtain the P_key corresponding to the IBD 
>>> device
>>> and HCA port  that it is bound to.  So, we want to improve the 
>>> process of
>>> gathering IPoIB related configuration information.
>>>
>>> 4.2.3 Update IBD driver to use Brussels framework (CR 6883212)
>>>
>>> IPoIB  tunables  are  managed  today through /etc/system file or 
>>> ibd.conf.
>>> For example, to modify the 'linkmode' of the IPoIB link, one needs 
>>> to edit
>>> ibd.conf and reboot the system. We want to replace this interface so 
>>> that
>>> user can manage these tunables using dladm(1M).
>>>
>>> 4.3 Proposal
>>>
>>> A micro/patch binding is asserted for this proposal.
>>>
>>> 4.3.0 IPoIB administration with dladm(1M)
>>> ----------------------------------------
>>>
>>> This case proposes a new IPoIB  administration  mechanism for  
>>> InfiniBand
>>> network   datalinks using  dladm(1M)  command.  Also, it  proposes 
>>> to add
>>> a consolidation private InfiniBand specific library API to  
>>> libdladm(3LIB).
>>> The   mechanism  is  very  similar  to  the   existing  VLAN,  VNIC, 
>>> etc.
>>> administration dladm(1M) sub-commands.
>>>
>>> In the new model, two classes of IPoIB datalinks will exist:
>>>
>>>       1. Datalinks representing the physical  IB ports, which will 
>>> use the
>>>          existing "phys" class that is used for  Ethernet, WiFi, and 
>>> other
>>>          physical media.  As  with all  "phys" class  objects,  the 
>>> system
>>>          will create these automatically.
>>>
>>>       2. Datalinks  representing the  administratively  created 
>>> partitions
>>>          over "phys" IPoIB objects,  which  will  use a  new "part" 
>>> class.
>>>          Each IB partition datalink will be  associated with a 
>>> P_Key  in a
>>>          manner  analogous  to the  way  each   Ethernet  VLAN 
>>> datalink is
>>>          associated with a VLAN ID.
>>>
>>> Note  that  unlike other  "phys"  class objects, IB "phys" objects 
>>> cannot
>>> send data (since  IB requires a  P_Key to  send data)  and thus 
>>> cannot be
>>> plumbed.
>>>
>>> Example configuration used to  explain the new IPoIB administration 
>>> model
>>>
>>>                 --------------------------------------
>>>                |              IB switch               |
>>>                 --------------------------------------
>>>                     |                         |
>>>                     |                         |
>>>                     |                         |
>>>           Port1 |   | Port2           Port1   |   | Port2
>>>                 --------------------------------------
>>>                |  HCA1                         HCA2   |
>>>                |              Node 1                  |
>>>                 --------------------------------------
>>>
>>> Port 2 of the HCA1 and Port 1 of the  HCA2 are connected to IB 
>>> switch and
>>> SM is running the switch. Each port is configured with two P_Key's 
>>> (0xffff
>>> &  0x8001).
>>>
>>>
>>> 4.3.1 Physical datalinks
>>> -------------------------
>>>
>>> One physical datalink will be  created by default per port per HCA. 
>>> These
>>> physical links serve as administrative&  observability data points. 
>>> These
>>> IB physical  datalinks allow  creating IB  partitions  over them  
>>> similar
>>> to creating  VNICs on  Ethernet physical  links or link  
>>> aggregations. IB
>>> physical  datalinks are  not used for  data transfers.  So, the 
>>> plumb and
>>> assigning a IB address are not supported on these links. In future, 
>>> these
>>> physical  datalinks can  be  used for  1)  implementing  the  port  
>>> level
>>> statistics 2) implementing port level snoop etc.
>>>
>>> Example 1. # dladm show-phys
>>> LINK     MEDIA        STATE      SPEED     DUPLEX     DEVICE
>>> ibp0    InfiniBand     up        8000      unknown    ibp0
>>> ibp1    InfiniBand    down       8000      unknown    ibp1
>>> ibp2    InfiniBand    down       2000      unknown    ibp2
>>> ibp3    InfiniBand     up        2000      unknown    ibp3
>>>
>>> The state of the physical link is directly corresponds to state of 
>>> the IB
>>> HCA port. As you  might expect other generic sub-commands such as 
>>> rename-
>>> link, show-link, delete-phys, etc. also work on IB datalinks.
>>>
>>> 4.3.2 IB partition Objects
>>> --------------------------
>>>
>>> IB  partition objects represent a new "part"  class of datalink and 
>>> these
>>> objects are  managed  using the  new dladm(1M)  sub-commands. All 
>>> the new
>>> sub-command interfaces are similar to VLAN/VNIC dladm(1M) 
>>> sub-commands. IB
>>> partition datalinks can be created  on the  top of  IB physical 
>>> links one
>>> per each  P_Key on the port.  These  links  are  used for data 
>>> transfers.
>>> The updated dladm(1M) man page describes the different commands 
>>> available
>>> for managing the IB partition links.
>>>
>>> 4.3.2.1 create-part
>>>
>>>        Creates a new IB partition Object with the specified datalink 
>>> name.
>>>
>>>        Example 2: Create a IB partition link for the  P_Key  0x8001 
>>> on the
>>>                   top of ibp0 physical datalink
>>>
>>>        # dladm create-part -l ibp0 -P 0x8001 p8001.ibp0
>>>
>>>        The above commands succeeds if the port is "up", P_Key is 
>>> present on
>>>        the port and  IPoIB  successfully completes its 
>>> initialization. On
>>>        subsequent reboots after successful initial creation, the 
>>> partition
>>>        will be available even if port is "down" or P_Key is absent, 
>>> though
>>>        the datalink state will be marked as "down".
>>>
>>>        Example 3:  Create an IB partition link for the P_Key 0x9000 
>>> on the
>>>                    top of ibp2
>>>
>>>        # dladm create-part -f  -l ibp2 -P 0x9000 p9000.ibp2
>>>
>>>        The force option  "-f"  allows to create the IB partition 
>>> even when
>>>        the  P_Key is not  present or Port is down.  The link state 
>>> will be
>>>        marked  as down.  The link state will be updated to "up" when 
>>> P_Key
>>>        is added to the port and port is activated.
>>>
>>>        Example 4: Plumb and assign a IP address to IB partition 
>>> p9000.ibp2
>>>
>>>        # ifconfig p9000.ibp2 plumb up
>>>
>>>        # ifconfig -a
>>>
>>>        p9000.ibp2: flags=1000843<UP,BROADCAST,RUNNING,
>>>             MULTICAST,IPv4>  mtu 2044 index 3
>>>          inet 1.1.1.1 netmask ff000000 broadcast 1.255.255.255
>>>
>>>       Example 5: Display the partition links using show-link command
>>>
>>>       # dladm show-link
>>>       LINK        CLASS     MTU    STATE    BRIDGE       OVER
>>>       p8001.ibp0  part      65520  unknown    --         ibp0
>>>       p9000.ibp2  part      65520  down       --         ibp2
>>>
>>> 4.3.2.2 delete-part
>>>
>>>        Deletes a specified IB partition object
>>>
>>>        Example 6: Delete the partition p8001.ibp0
>>>
>>>        # dladm  delete-part      p8001.ibp0
>>>
>>>        The above command deletes the partition.
>>>
>>>        Example 7: Show the partition link information after
>>>                   deleting p8001.ibp0
>>>
>>>        # dldam show-part
>>>
>>>        LINK        PKEY      OVER     STATE      FLAGS
>>>       p9000.ibp2   9000      ibp2      down       f---
>>>
>>> 4.3.2.3 show-part
>>>
>>>       Displays IB partition object information
>>>
>>>       Example 8: Display the IB partition links information (below 
>>> output
>>>                  is for the "part" operations in Example 2, 3&  4)
>>>
>>>       # dldam show-part
>>>
>>>       LINK           PKEY      OVER       STATE      FLAGS
>>>     p8001.ibp0       8001      ibp0      unknown     ----
>>>     p9000.ibp2       9000      ibp2      down        f---
>>>
>>>     The state  of the IB partition link will be "unknown" after IB 
>>> partition
>>>     is created and before IB partition is plumbed. Once partition is 
>>> plumbed
>>>     the  link  state will be set to "up" when the link is ready to 
>>> use.  The
>>>     state of the  link will be set to "down" if 1) HCA port is down 
>>> 2) P_Key
>>>     is absent or 3) broadcast group is absent.
>>>
>>> 4.3.2.4 show-ib
>>>
>>>      Displays IB specific information such as port#, port guid, etc.
>>>
>>>      Example 9: Show IB specific information
>>>
>>>      # dladm show-ib
>>>      LINK         HCAGUID         PORTGUID        PORT STATE  PKEYS
>>>      ibp0         3BA000100CD7C   3BA000100CD7D   1    down   FFFF
>>>      ibp1         3BA000100CD7C   3BA000100CD7E   2    down   FFFF
>>>      ibp3         5AD0000033634   5AD0000033636   2    up     FFFF,8001
>>>      ibp2         5AD0000033634   5AD0000033635   1    up     FFFF,8001
>>>
>>>      show-ib  commands  display only the  physical  links, port 
>>> GUID, port#
>>>      HCA GUID, and  P_Key present on the  port at the  time of  
>>> running the
>>>      command.
>>>
>>> 4.3.3 IB partition object Administration Library
>>>
>>> The  libdladm(3LIB)  library  is  currently  used  to  implement  
>>> datalink
>>> administration  for  all  the  GLDv3  datalinks  (VNIC, link  
>>> aggregation,
>>> wireless, IP Tunnel, etc.). The library will be further enhanced to 
>>> provide
>>> administrative  interfaces of  InfiniBand  partition  objects. All 
>>> the new
>>> library  extensions  are  similar  to  VLAN/VNIC  libdladm  
>>> extensions. IB
>>> specific functionality will be implemented by libdlib.c  and its 
>>> interface
>>> is provided via  libdlib.h.  dladm(1M) will use this  library for 
>>> managing
>>> IB   partitions.  The  IB  partition   administration  library  
>>> provides a
>>> persistent repository which allows IB partition configuration to be 
>>> stored
>>> across  reboots.  It uses the existing dladm(1M)  
>>> /etc/dladm/datalink.conf
>>> repository to store the IB partition configuration.
>>>
>>> The list of InfiniBand specific extensions to libdladm library is 
>>> given below.
>>> For  more  details  of  each  API,  see  man  pages in the  PSARC 
>>> materials
>>> directory. The man pages are only for PSARC review (not intended for 
>>> public
>>> use).
>>>
>>> 4.3.3.1 dladm_part_create()
>>>      Create a IB partition object
>>>
>>> 4.3.3.2 dladm_part_delete()
>>>      Deletes a IB partition object
>>>
>>> 4.3.3.3 dladm_part_info()
>>>      Returns IB partition Object attributes
>>>
>>> 4.3.3.4 dladm_ib_info()
>>>      Returns IB specific attributes such as port number, port guid,
>>>      and HCA GUID.
>>>
>>> 4.3.3.5 dladm_part_up()
>>>      Brings up one or all the IB partition objects during every boot.
>>>
>>> 4.3.4 IBTF Extensions
>>> ---------------------
>>>
>>> Some of the IB ULP's such as SDP, IBCM, etc. walk the device tree 
>>> and read
>>> "port-pkey", "hca-guid", and "port-guid" properties from the ibd(7D) 
>>> device
>>> instance. With the new model, IB partitions  are just  objects (no  
>>> longer
>>> device nodes).  So, IB ULP's no  longer can retrieve partition  
>>> attributes
>>> from the  device  node.  The  following new  IBTF  extensions   
>>> provide  a
>>> mechanism  to  retrieve  the  partition  attributes  in  the   
>>> kernel. See
>>> ibt_get_part_attrs() man page in the PSARC case directory for more 
>>> details
>>> about the  following API.  The man pages are  only for  PSARC 
>>> review  (not
>>> intended for public use).
>>>
>>> 4.3.4.1 ibt_get_part_attrs()
>>>        Returns the attributes of a requested IB partition
>>>
>>> 4.3.4.2 ibt_get_all_part_attrs()
>>>        Returns the attributes of all the active IB partitions
>>>
>>> 4.3.4.3 ibt_free_part_attrs()
>>>        Frees the memory for partition attribute structure allocated by
>>>        ibt_get_all_part_attrs()
>>>
>>> 4.3.5 Man page changes
>>> ----------------------
>>>
>>>      Updated man pages - published
>>>           dladm(1M)
>>>           datadm(1M)
>>>           dat.conf(4)
>>>           ibp(7D)  (ibd(7d) name changed)
>>>
>>>      New Man pages - internal only
>>>           dladm_part_create(3dladm)
>>>           dladm_part_delete(3dladm)
>>>           dladm_part_info(3dladm)
>>>           dladm_ib_info(3dladm)
>>>           dladm_part_up(3dladm)
>>>
>>>           ibt_get_part_attrs(9f)
>>>
>>> 4.3.6 Interface table
>>> ---------------------
>>>
>>>   
>>> -----------------------------------------------------------------------
>>> |    Interface name                      |  Commitment 
>>> Level            |
>>>   
>>> -----------------------------------------------------------------------
>>> |  dladm(1M)  
>>> extensions                                                |
>>>   
>>> -----------------------------------------------------------------------
>>> |      create-part                       
>>> |                              |
>>> |      delete-part                       | 
>>> committed                    |
>>> |      show-ib                           
>>> |                              |
>>> |      show-part                         
>>> |                              |
>>>   
>>> -----------------------------------------------------------------------
>>> |  libdladm extensions   
>>> (libdlib.h)                                    |
>>>   
>>> -----------------------------------------------------------------------
>>> |     dladm_part_create                  
>>> |                              |
>>> |     dladm_part_delete                  
>>> |                              |
>>> |     dladm_part_show                    
>>> |                              |
>>> |     dladm_show_ib                      | ON Consolidation 
>>> Private     |
>>> |     dladm_part_up                      
>>> |                              |
>>> |     dladm_part_attr_t                  
>>> |                              |
>>> |     dladm_ib_attr_t                    
>>> |                              |
>>> |     DLADM_IBPART_FORCE_CREATE          
>>> |                              |
>>>   
>>> -----------------------------------------------------------------------
>>> |  InfiniBand Specific Link 
>>> Properties                                  |
>>>   
>>> -----------------------------------------------------------------------
>>> |     linkmode                           | ON Consolidation 
>>> Private     |
>>>   
>>> -----------------------------------------------------------------------
>>> |  
>>> libdlmgmt.h                                                          |
>>>   
>>> -----------------------------------------------------------------------
>>> |     DATALINK_CLASS_IBPART              | ON Consolidation 
>>> Private     |
>>>   
>>> -----------------------------------------------------------------------
>>> |  
>>> libdladm.h                                                           |
>>>   
>>> -----------------------------------------------------------------------
>>> |     DLADM_STATUS_INVALID_PORT_INSTANCE 
>>> |                              |
>>> |     DLADM_STATUS_PORT_IS_DOWN          
>>> |                              |
>>> |     DLADM_STATUS_PKEY_NOT_PRESENT      
>>> |                              |
>>> |     DLADM_STATUS_PARTITION_EXISTS      | ON Consolidation 
>>> Private     |
>>> |     DLADM_STATUS_INVALID_PKEY          
>>> |                              |
>>> |     DLADM_STATUS_NO_HW_RESOURCE        
>>> |                              |
>>> |     DLADM_STATUS_INVALID_PKEY_TBL_SIZE 
>>> |                              |
>>>   
>>> -----------------------------------------------------------------------
>>> |  IBTF 
>>> extensions                                                      |
>>>   
>>> -----------------------------------------------------------------------
>>> |      ibt_get_part_attrs()              
>>> |                              |
>>> |      ibt_get_all_part_attrs()          | ON Consolidation 
>>> private     |
>>> |      ibt_free_part_attrs()             
>>> |                              |
>>> |      ibt_part_attr_t                   
>>> |                              |
>>>   
>>> -----------------------------------------------------------------------
>>> |   New status codes (ibt_status_t 
>>> updates)                             |
>>>   
>>> -----------------------------------------------------------------------
>>> |      IBT_NO_SUCH_OBJECT                | ON Consolidation 
>>> private     |
>>>   
>>> -----------------------------------------------------------------------
>>>
>>> 4.3.7 References
>>> ---------------------
>>>
>>> PSARC 2001/289 IP over InfiniBand
>>> PSARC 2007/636 IPoIB Conversion to GLDv3
>>> PSARC 2009/593 IPoIB Connected Mode
>>>
>>> 6. Resources and Schedule
>>>      6.4. Steering Committee requested information
>>>         6.4.1. Consolidation C-team Name:
>>>         ON
>>>      6.5. ARC review type: FastTrack
>>>      6.6. ARC Exposure: open
>>>
>>
>>
>
>

Reply via email to