Hello Hal,

thank you for the fast response. I will try to clarify some points.

>>  d) OpenMPI runs are executed with "--mca 
>> btl_openib_ib_path_record_service_level 1"
> 
> I'm not familiar with what DFSSSP does to figure out SLs exactly but
> there should be no need to set this. The proper SL for querying the SA
> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
> (and other QoS based routing algorithms), it calculates that and the SM
> pushes this into each port. That should be used. It's possible that SL1
> is not a valid SL for port <-> SA querying using DFSSSP.
The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify 
the SL for querying the PathRecords.
It just enables the functionality. And the ompi processes use the PortInfo.SMSL 
to send the request.
For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA 
received the requests.  
> 
>>  e) kernel 2.6.32-220.13.1.el6.x86_64
>> 
>> As far as I understand the whole system:
>>  1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to 
>> the OpenSM
>>  2. the SA receives the request on QP1
> 
> There is the SL in the query itself. This should be the SMSL that the SM
> set for that port.
Hmm, there you might have a point. I think I saw that the query itself had SL=0 
specified.
In fact OpenMPI sets everthing to 0 except for slid and dlid.
> 
>>  3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a 
>> special service level for the slid/dlid path
> 
> This is a (potentially) different SL (for MPI<->MPI port communication)
> than the one the query used and is the one returned inside the
> PathRecord attribute/data.
Yes, it can be different, but DFSSSP sets the same SL, because the SM is 
running on a port which is also used for MPI comm.
> 
>>  4. SA sends the PathRecord back to the OMPI process via umad_send in 
>> libvendor/osm_vendor_ibumad.c
> 
> By the response reversibility rule, I think this is returned on the SL
> of the original query but haven't verified this in the code base yet.
Ok, I was not aware of that rule. But if this is true, then the SA should also 
be able to send via SL>0.
> 
>> The osm_vendor_send() function builds the MAD packet with the following 
>> attributes:
>>        /* GS classes */
>>        umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>                          p_mad_addr->addr_type.gsi.remote_qp,
>>                          p_mad_addr->addr_type.gsi.service_level,
>>                          IB_QP1_WELL_KNOWN_Q_KEY);
>> So, the SL is the same like the one which was used by the OMPI process. The 
>> Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is 
>> correct, too.
>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and 
>> this send does not work (except for SL=0).
> 
> By not working, what do you mean ? Do you mean it's not received at the
> requester with no message in the OpenSM log or not received at the
> OpenSM or something else ? It could be due to the wrong SL being used in
> the original request (forcing it to SL 1). That could cause it not to be
> received at the SM or the response not to make it back to the requester
> from the SA if the SL used is not "reversible".
By "not working" I mean, that the MPI process does not receive any response 
from the SA.
I get messages from the MPI process like the following:
[rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] 
No response from SA after 20 retries
The log of OpenSM shows that the SA received the PathRequest query, dumps the 
query into the log, and sends the reply back.
And I think I was some messages in the log about "…1 outstanding MAD…".
> 
>> If I look into the MAD before it is send, then it looks like this:
>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, 
>> timeout_ms=0, retries=3)
>>    at src/umad.c:791
>> 791             if (umaddebug > 1)
>> (gdb) p *mad
>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, 
>> addr = {qpn = 1325427712, qkey = 384, 
>>    lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', 
>> gid_index = 0 '\000', 
>>    hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 
>> times>, flow_label = 0, 
>>    pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 
>> "\002"}
> 
> Is this the PathRecord query on the OpenMPI side or the response on the
> OpenSM side ? SL is 6 rather than 1 here.
This is the response on the OpenSM side (inside the umad_send function, right 
before it is written to the device with write(fd, …).
SL=6 indicates, that the MPI process was sending the request on SL 6.
> 
>> The output of OpenMPI or OpenSM's log file don't show any useful information 
>> for this problem, even with higher debug levels.
> 
> So nothing interesting logged relative to the PathRecord queries ?
In the OpenSM log, only that it was received, how the request looks like, and 
that it was send back.
And a few "outstanding MADs" a few lines later in the log.
> 
>> So, right now I'm stuck, and have no idea if there is an error in the kernel 
>> driver, the HCA firmware or something completely different. Or if umad_send 
>> basically does not support SL>0.
>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) 
>> call to 0.
> 
> So SL 0 works between all nodes and SA for querying/responses. Wonder if
> that's how SMSL is set by DFSSSP.
No, the SMSL set by DFSSSP is different from 0, I have checked this. In our 
case (OpenSM running on a compute node), it sets the same SL, which is used for 
MPI<->MPI traffic, to ensure deadlock freedom.

Regards
Jens

--------------------------------
Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku, 
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: [email protected]
--------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to