Hello Hal,

On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:

> Hi again,
> 
> On 12/14/2012 10:17 AM, Jens Domke wrote:
>> Hello Hal,
>> 
>> thank you for the fast response. I will try to clarify some points.
>> 
>>>> d) OpenMPI runs are executed with "--mca 
>>>> btl_openib_ib_path_record_service_level 1"
>>> 
>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>> there should be no need to set this. The proper SL for querying the SA
>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>> (and other QoS based routing algorithms), it calculates that and the SM
>>> pushes this into each port. That should be used. It's possible that SL1
>>> is not a valid SL for port <-> SA querying using DFSSSP.
>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not 
>> specify the SL for querying the PathRecords.
>> It just enables the functionality. And the ompi processes use the 
>> PortInfo.SMSL to send the request.
>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA 
>> received the requests.  
>>> 
>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>> 
>>>> As far as I understand the whole system:
>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to 
>>>> the OpenSM
>>>> 2. the SA receives the request on QP1
>>> 
>>> There is the SL in the query itself. This should be the SMSL that the SM
>>> set for that port.
>> Hmm, there you might have a point. I think I saw that the query itself had 
>> SL=0 specified.
>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>> 
>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a 
>>>> special service level for the slid/dlid path
>>> 
>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>> than the one the query used and is the one returned inside the
>>> PathRecord attribute/data.
>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is 
>> running on a port which is also used for MPI comm.
> 
> With DFSSSP are all SLs same from source port to get to any destination ?
No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == 
SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
> 
>>> 
>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in 
>>>> libvendor/osm_vendor_ibumad.c
>>> 
>>> By the response reversibility rule, I think this is returned on the SL
>>> of the original query but haven't verified this in the code base yet.
>> Ok, I was not aware of that rule. But if this is true, then the SA should 
>> also be able to send via SL>0.
> 
> I doubled checked and indeed the SA response does use the SL that the
> incoming request was received on.
> 
>>> 
>>>> The osm_vendor_send() function builds the MAD packet with the following 
>>>> attributes:
>>>>       /* GS classes */
>>>>       umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>                         p_mad_addr->addr_type.gsi.remote_qp,
>>>>                         p_mad_addr->addr_type.gsi.service_level,
>>>>                         IB_QP1_WELL_KNOWN_Q_KEY);
>>>> So, the SL is the same like the one which was used by the OMPI process. 
>>>> The Q_Key matches the Q_key on the OMPI process, and remote_qp and 
>>>> dest_lid is correct, too.
>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and 
>>>> this send does not work (except for SL=0).
>>> 
>>> By not working, what do you mean ? Do you mean it's not received at the
>>> requester with no message in the OpenSM log or not received at the
>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>> the original request (forcing it to SL 1). That could cause it not to be
>>> received at the SM or the response not to make it back to the requester
>>> from the SA if the SL used is not "reversible".
>> By "not working" I mean, that the MPI process does not receive any response 
>> from the SA.
>> I get messages from the MPI process like the following:
>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info]
>>  No response from SA after 20 retries
>> The log of OpenSM shows that the SA received the PathRequest query, dumps 
>> the query into the log, and sends the reply back.
>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>> 
>>>> If I look into the MAD before it is send, then it looks like this:
>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, 
>>>> timeout_ms=0, retries=3)
>>>>   at src/umad.c:791
>>>> 791             if (umaddebug > 1)
>>>> (gdb) p *mad
>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, 
>>>> addr = {qpn = 1325427712, qkey = 384, 
>>>>   lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', 
>>>> gid_index = 0 '\000', 
>>>>   hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 
>>>> times>, flow_label = 0, 
>>>>   pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 
>>>> 0x7fffe8012530 "\002"}
>>> 
>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>> OpenSM side ? SL is 6 rather than 1 here.
>> This is the response on the OpenSM side (inside the umad_send function, 
>> right before it is written to the device with write(fd, …).
>> SL=6 indicates, that the MPI process was sending the request on SL 6.
> 
> What is SMSL for the requester ? Was it SL 6 ?
Yes, it was SL 6.
Here is a content of a similar packet which was received by the SA. I have used 
ibdump on the port where the OpenSM was running:
======================================================================================
No.     Time        Source                Destination           Protocol Length 
Info
    785 14.352168   LID: 384              LID: 4140             InfiniBand 290  
  UD Send Only SubnAdmGet(PathRecord)

Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
    Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
    Epoch Time: 1355389784.437633332 seconds
    [Time delta from previous captured frame: 4.332020528 seconds]
    [Time delta from previous displayed frame: 4.332020528 seconds]
    [Time since reference or first frame: 14.352168681 seconds]
    Frame Number: 785
    Frame Length: 290 bytes (2320 bits)
    Capture Length: 290 bytes (2320 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: erf:infiniband]
Extensible Record Format
    [ERF Header]
        Timestamp: 0x50c99b587008bcf2
        [Header type]
            .001 0101 = type: INFINIBAND (21)
            0... .... = Extension header present: 0
        0000 0100 = flags: 4
            .... ..00 = capture interface: 0
            .... .1.. = varying record length: 1
            .... 0... = truncated: 0
            ...0 .... = rx error: 0
            ..0. .... = ds error: 0
            00.. .... = reserved: 0
        record length: 306
        loss counter: 0
        wire length: 290
InfiniBand
    Local Route Header
        0110 .... = Virtual Lane: 0x06
        .... 0000 = Link Version: 0
        0110 .... = Service Level: 6
        .... 00.. = Reserved (2 bits): 0
        .... ..10 = Link Next Header: 0x02
        Destination Local ID: 19
        0000 0... .... .... = Reserved (5 bits): 0
        .... .000 0100 1000 = Packet Length: 72
        Source Local ID: 16
    Base Transport Header
        Opcode: 100
        1... .... = Solicited Event: True
        .1.. .... = MigReq: True
        ..00 .... = Pad Count: 0
        .... 0000 = Header Version: 0
        Partition Key: 65535
        Reserved (8 bits): 0
        Destination Queue Pair: 0x000001
        0... .... = Acknowledge Request: False
        .000 0000 = Reserved (7 bits): 0
        Packet Sequence Number: 0
    DETH - Datagram Extended Transport Header
        Queue Key: 2147549184
        Reserved (8 bits): 0
        Source Queue Pair: 0x00380050
    MAD Header - Common Management Datagram
        Base Version: 0x01
        Management Class: 0x03
        Class Version: 0x02
        Method: Get() (0x01)
        Status: 0x0000
        Class Specific: 0x0000
        Transaction ID: 0x0010000f38005000
        Attribute ID: 0x0035
        Reserved: 0x0000
        Attribute Modifier: 0x00000000
        MAD Data Payload: 000000000000000000000000000000000000000000000000...
     Illegal RMPP Type (0)! 
        RMPP Type: 0x00
        RMPP Type: 0x00
        0000 .... = R Resp Time: 0x00
        .... 0000 = RMPP Flags: Unknown (0x00)
        RMPP Status:  (Normal) (0x00)
        RMPP Data 1: 0x00000000
        RMPP Data 2: 0x00000000
    SMASubnAdmGet(PathRecord)
        SM_Key (Verification Key): 0x0000000000000000
        Attribute Offset: 0x0000
        Reserved: 0x0000
        Component Mask: 0x0000003000000000
        Attribute (PathRecord)
            PathRecord
                DGID: :: (::)
                SGID: ::0.15.0.16 (::0.15.0.16)
                DLID: 0x0000
                SLID: 0x0000
                0... .... = RawTraffic: 0x00
                .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
                HopLimit: 0x00
                TClass: 0x00
                0... .... = Reversible: 0x00
                .000 0000 = NumbPath: 0x00
                P_Key: 0x0000
                .... .... .... 0000 = SL: 0x0000
                00.. .... = MTUSelector: 0x00
                ..00 0000 = MTU: 0x00
                00.. .... = RateSelector: 0x00
                ..00 0000 = Rate: 0x00
                00.. .... = PacketLifeTimeSelector: 0x00
                ..00 0000 = PacketLifeTime: 0x00
                Preference: 0x00
    Variant CRC: 0xad4e
======================================================================================
> 
> One would need to walk the SLToVLMappingTables from requester (OMPI
> port) to SA and back to see whether SL6 would even have a chance of
> working (not dropping) aside from whether it's really the correct SL to use.
All SL2VL tables look the same. I checked the output of OpenSM.
        SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 
13 | 14 | 15 |
        VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 
|0x5 |0x6 |0x7 |
But this is also as expected, because I have set the QoS in the opensm config 
as follows:
        qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
This was set for "default", "CA" and "Switch external ports". I have not 
touched the config for "Switch Port 0" and "Router ports", they remained: 
qos_[sw0 | rtr]_sl2vl (null)

Regards
Jens

> 
> -- Hal
> 
>>> 
>>>> The output of OpenMPI or OpenSM's log file don't show any useful 
>>>> information for this problem, even with higher debug levels.
>>> 
>>> So nothing interesting logged relative to the PathRecord queries ?
>> In the OpenSM log, only that it was received, how the request looks like, 
>> and that it was send back.
>> And a few "outstanding MADs" a few lines later in the log.
>>> 
>>>> So, right now I'm stuck, and have no idea if there is an error in the 
>>>> kernel driver, the HCA firmware or something completely different. Or if 
>>>> umad_send basically does not support SL>0.
>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) 
>>>> call to 0.
>>> 
>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>> that's how SMSL is set by DFSSSP.
>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our 
>> case (OpenSM running on a compute node), it sets the same SL, which is used
> for MPI<->MPI traffic, to ensure deadlock freedom.
>> 
>> Regards
>> Jens
>> 
>> --------------------------------
>> Dipl.-Math. Jens Domke
>> Researcher - Tokyo Institute of Technology
>> Satoshi MATSUOKA Laboratory
>> Global Scientific Information and Computing Center
>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>> Tokyo, 152-8550, JAPAN
>> Tel/Fax: +81-3-5734-3876
>> E-Mail: [email protected]
>> --------------------------------
>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--------------------------------
Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku, 
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: [email protected]
--------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to