Hello Hal,
I have checked the smpquery and saquery command today.
The smpquery SL2VL and PI commands for the opensm port work fine, and I get the
expected results:
======================================================
# SL2VL table: Lid 19
# SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
======================================================
# Port info: Lid 19 port 0
Mkey:............................<not displayed>
GidPrefix:.......................0xfe80000000000000
Lid:.............................19
SMLid:...........................19
CapMask:.........................0x251086a
IsSM
IsTrapSupported
IsAutomaticMigrationSupported
IsSLMappingSupported
IsSystemImageGUIDsupported
IsCommunicatonManagementSupported
IsVendorClassSupported
IsCapabilityMaskNoticeSupported
IsClientRegistrationSupported
DiagCode:........................0x0000
MkeyLeasePeriod:.................0
LocalPort:.......................1
LinkWidthEnabled:................1X or 4X
LinkWidthSupported:..............1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkDownDefState:................Polling
ProtectBits:.....................0
LMC:.............................0
LinkSpeedActive:.................5.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps
NeighborMTU:.....................2048
SMSL:............................0
VLCap:...........................VL0-7
InitType:........................0x00
VLHighLimit:.....................0
VLArbHighCap:....................8
VLArbLowCap:.....................8
InitReply:.......................0x00
MtuCap:..........................2048
VLStallCount:....................0
HoqLife:.........................31
OperVLs:.........................VL0-7
PartEnforceInb:..................0
PartEnforceOutb:.................0
FilterRawInb:....................0
FilterRawOutb:...................0
MkeyViolations:..................0
PkeyViolations:..................0
QkeyViolations:..................0
GuidCap:.........................32
ClientReregister:................0
McastPkeyTrapSuppressionEnabled:.0
SubnetTimeout:...................18
RespTimeVal:.....................16
LocalPhysErr:....................8
OverrunErr:......................8
MaxCreditHint:...................0
RoundTrip:.......................0
CapabilityMask2:.................0x0000
LinkSpeedExtActive:..............No Extended Speed
LinkSpeedExtSupported:...........0
LinkSpeedExtEnabled:.............0
======================================================
The problem are the saquery commands on other nodes.
In most cases the executions fails, and the node shows the same behaviour like
the OpenSM node, when it trys to send on SL>0. The PathRequest paket does not
arrive at the node with the running OpenSM (checked with ibdumb). At some point
of the execution the saquery binary hangs, the kernel log indicates errors and
the only option is to reboot.
This is the output I see for the saquery:
======================================================
saquery -P --src-to-dst 4:8
ibwarn: [2535] sa_query: umad_recv failed: attr 0x11: Connection timed out
Query SA failed: Connection timed out
======================================================
(In really rar cases I get the PathRequest back and see the dump, but the
saquery binary stalls afterwards, too.)
I did some debugging with gdb again, and stepped thru the saquery code.
When I change the SL to 0 in the addr vector of the MAD right before umad_send
is called, then everthing works.
So, the saquery on the compute nodes shows the same behaviour as the opensm
with respect to the SL value for umad_send.
At the end I tried to run MinHop instead of DFSSSP, and specified sm_sl 1 in
the config file of opensm.
Sadly, this configuration results in the same crashes of the saquery commands.
For the runs with MinHop I used also a different SL2VL mapping, just to be
sure, that there is no problem with VL>0 and every SL travels on VL=0:
======================================================
# SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in 0, out 0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
======================================================
Regards,
Jens
On Dec 16, 2012, at 11:59 PM, Jens Domke wrote:
>
> On Dec 16, 2012, at 10:48 PM, Hal Rosenstock wrote:
>
>> On 12/16/2012 8:39 AM, Jens Domke wrote:
>>> Hi,
>>>
>>> On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote:
>>>
>>>> Hi,
>>>>
>>>> On 12/16/2012 7:03 AM, Jens Domke wrote:
>>>>> Hello Hal,
>>>>>
>>>>> On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 12/14/2012 3:32 PM, Jens Domke wrote:
>>>>>>> Hello Hal,
>>>>>>>
>>>>>>> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 12/14/2012 1:24 PM, Jens Domke wrote:
>>>>>>>>> Hello Hal,
>>>>>>>>>
>>>>>>>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>>>>>>>>>
>>>>>>>>>> Hi again,
>>>>>>>>>>
>>>>>>>>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>>>>>>>>> Hello Hal,
>>>>>>>>>>>
>>>>>>>>>>> thank you for the fast response. I will try to clarify some points.
>>>>>>>>>>>
>>>>>>>>>>>>> d) OpenMPI runs are executed with "--mca
>>>>>>>>>>>>> btl_openib_ib_path_record_service_level 1"
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly
>>>>>>>>>>>> but
>>>>>>>>>>>> there should be no need to set this. The proper SL for querying
>>>>>>>>>>>> the SA
>>>>>>>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of
>>>>>>>>>>>> DFSSSP
>>>>>>>>>>>> (and other QoS based routing algorithms), it calculates that and
>>>>>>>>>>>> the SM
>>>>>>>>>>>> pushes this into each port. That should be used. It's possible
>>>>>>>>>>>> that SL1
>>>>>>>>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>>>>>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does
>>>>>>>>>>> not specify the SL for querying the PathRecords.
>>>>>>>>>>> It just enables the functionality. And the ompi processes use the
>>>>>>>>>>> PortInfo.SMSL to send the request.
>>>>>>>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test,
>>>>>>>>>>> and the SA received the requests.
>>>>>>>>>>>>
>>>>>>>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>>>>>>>>>
>>>>>>>>>>>>> As far as I understand the whole system:
>>>>>>>>>>>>> 1. the OMPI processes are sending MAD requests
>>>>>>>>>>>>> (SubnAdmGet:PathRecord) to the OpenSM
>>>>>>>>>>>>> 2. the SA receives the request on QP1
>>>>>>>>>>>>
>>>>>>>>>>>> There is the SL in the query itself. This should be the SMSL that
>>>>>>>>>>>> the SM
>>>>>>>>>>>> set for that port.
>>>>>>>>>>> Hmm, there you might have a point. I think I saw that the query
>>>>>>>>>>> itself had SL=0 specified.
>>>>>>>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>>>>>>>>>
>>>>>>>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or
>>>>>>>>>>>>> Torus_2QoS) about a special service level for the slid/dlid path
>>>>>>>>>>>>
>>>>>>>>>>>> This is a (potentially) different SL (for MPI<->MPI port
>>>>>>>>>>>> communication)
>>>>>>>>>>>> than the one the query used and is the one returned inside the
>>>>>>>>>>>> PathRecord attribute/data.
>>>>>>>>>>> Yes, it can be different, but DFSSSP sets the same SL, because the
>>>>>>>>>>> SM is running on a port which is also used for MPI comm.
>>>>>>>>>>
>>>>>>>>>> With DFSSSP are all SLs same from source port to get to any
>>>>>>>>>> destination ?
>>>>>>>>> No, not necessarily. In general DFSSSP does not enforce
>>>>>>>>> SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
>>>>>>>>
>>>>>>>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
>>>>>>> True. But i don't think that the SA asks the DFSSSP routing about the
>>>>>>> SL for the reversible path.
>>>>>>> So, the SA could use any SL which is a valid SL, even if the DFSSSP
>>>>>>> would recommend another SL.
>>>>>>>
>>>>>>> I just read the IB Specs and it says, that "SL specified in the
>>>>>>> received packet is used as the SL in the response packet" for MAD
>>>>>>> packets.
>>>>>>> So, its most likely, that there is a mismatch in the way how OMPI does
>>>>>>> the setup of the PathRequest and the way how the SA does build the
>>>>>>> respond packet.
>>>>>>> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest
>>>>>>> packet,
>>>>>>
>>>>>> So CompMask in the query has the SL bit on and SL is set to 0 inside the
>>>>>> SubAdmGet of PatchRecord ?
>>>>>
>>>>> No, the CompMask didn't had the SL bit and the SL was set to 0.
>>>>
>>>> That means the SL in the request is wildcarded so the SA/SM fills in a
>>>> valid one in the response.
>>> Ok.
>>>>
>>>>> I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only
>>>>> reference I found was in osm_sa_path_record.c
>>>>> The SA just treats the SL in the PathRequest as a "I would like to use
>>>>> this SL" in case the SL bit is set.
>>>>> But the routing engine can overwrite the requested SL before the reply is
>>>>> send.
>>>>>
>>>>> Nevertheless, I have changed the code of OMPI so that it sets the SL bit
>>>>> in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a
>>>>> == SL_b.
>>>>> Sadly, the reply send by the SA does not leave the node (for SL_b>0).
>>>>> Only if I change the SL to 0 in the MAD right before umad_send is called
>>>>> by the SA, the paket is able to leave the node and reaches the OMPI
>>>>> process.
>>>>
>>>> Are you sure the response doesn't leave the SA node or it's not received
>>>> at the requester (OMPI node) ?
>>> No, I'm not sure. Is there any possibility to check that? As far as I know,
>>> ibdump does not show MAD pakets which leave a port, it only shows the
>>> pakets when they are received on the other end.
>>>>
>>>>>
>>>>>>
>>>>>>> and sends the packet on SL_b (PortInfo.SMSL).
>>>>>>
>>>>>> Good.
>>>>>>
>>>>>>> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, for
>>>>>>> the response.
>>>>>>> If SL_b is not 0, then the packet can't reach the OMPI process. Right?
>>>>>>
>>>>>> Depends. It may be that both SLs work but maybe not.
>>>>>>
>>>>>>> If I analyse this correctly, then there are two bugs. One is in OMPI,
>>>>>>> that it does not specify the SL within the PathRequest in a appropriate
>>>>>>> way (which would be a SL suggested by DFSSSP for the reversible path).
>>>>>>> And the second bug is that the SA uses the SL, on which the PathRequest
>>>>>>> packet was send, and not the SL specified within the packet.
>>>>>>> What do you think?
>>>>>>
>>>>>> Yes, it might be better to wildcard the SL in the query. The only
>>>>>> scenario that would fail with the query you are making if there's no SL
>>>>>> 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
>>>>>> If that's the case, SA should return MAD status 0xc (status code 3 -
>>>>>> ERR_NO_RECORDS). But the response doesn't make it back to the requester
>>>>>> OMPI node so it's not even getting that far.
>>>>>
>>>>> Yes, exactly. So, do you have an idea why the response hands in the SA
>>>>> node?
>>>>> I have no inside of the underlying layer (kernel driver and fireware).
>>>>> Maybe there are some implementations, which prevent the SA from sending
>>>>> MADs back on SL>0?
>>>>
>>>> If you're sure this response doesn't get out of the SA node, please
>>>> contact Mellanox support with the details.
>>> Ok, I can do this, if it turns out to be true.
>>>>
>>>>>>
>>>>>>> I can try to change the PathRequest of OMPI tomorrow, so that it
>>>>>>> matches addr_type.gsi.service_level.
>>>>>>> Maybe, with this change the packets of the SA will reach the OMPI
>>>>>>> process on a SL>0.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send
>>>>>>>>>>>>> in libvendor/osm_vendor_ibumad.c
>>>>>>>>>>>>
>>>>>>>>>>>> By the response reversibility rule, I think this is returned on
>>>>>>>>>>>> the SL
>>>>>>>>>>>> of the original query but haven't verified this in the code base
>>>>>>>>>>>> yet.
>>>>>>>>>>> Ok, I was not aware of that rule. But if this is true, then the SA
>>>>>>>>>>> should also be able to send via SL>0.
>>>>>>>>>>
>>>>>>>>>> I doubled checked and indeed the SA response does use the SL that the
>>>>>>>>>> incoming request was received on.
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> The osm_vendor_send() function builds the MAD packet with the
>>>>>>>>>>>>> following attributes:
>>>>>>>>>>>>> /* GS classes */
>>>>>>>>>>>>> umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>>>>>>>> p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>>>>>>>> p_mad_addr->addr_type.gsi.service_level,
>>>>>>>>>>>>> IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>>>>>>>>> So, the SL is the same like the one which was used by the OMPI
>>>>>>>>>>>>> process. The Q_Key matches the Q_key on the OMPI process, and
>>>>>>>>>>>>> remote_qp and dest_lid is correct, too.
>>>>>>>>>>>>> Afterwards umad_send(…) is used to send the reply with the
>>>>>>>>>>>>> PathRecord, and this send does not work (except for SL=0).
>>>>>>>>>>>>
>>>>>>>>>>>> By not working, what do you mean ? Do you mean it's not received
>>>>>>>>>>>> at the
>>>>>>>>>>>> requester with no message in the OpenSM log or not received at the
>>>>>>>>>>>> OpenSM or something else ? It could be due to the wrong SL being
>>>>>>>>>>>> used in
>>>>>>>>>>>> the original request (forcing it to SL 1). That could cause it not
>>>>>>>>>>>> to be
>>>>>>>>>>>> received at the SM or the response not to make it back to the
>>>>>>>>>>>> requester
>>>>>>>>>>>> from the SA if the SL used is not "reversible".
>>>>>>>>>>> By "not working" I mean, that the MPI process does not receive any
>>>>>>>>>>> response from the SA.
>>>>>>>>>>> I get messages from the MPI process like the following:
>>>>>>>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info]
>>>>>>>>>>> No response from SA after 20 retries
>>>>>>>>>>> The log of OpenSM shows that the SA received the PathRequest query,
>>>>>>>>>>> dumps the query into the log, and sends the reply back.
>>>>>>>>>>> And I think I was some messages in the log about "…1 outstanding
>>>>>>>>>>> MAD…".
>>>>>>>>>>>>
>>>>>>>>>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>>>>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530,
>>>>>>>>>>>>> length=120, timeout_ms=0, retries=3)
>>>>>>>>>>>>> at src/umad.c:791
>>>>>>>>>>>>> 791 if (umaddebug > 1)
>>>>>>>>>>>>> (gdb) p *mad
>>>>>>>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3,
>>>>>>>>>>>>> length = 0, addr = {qpn = 1325427712, qkey = 384,
>>>>>>>>>>>>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0
>>>>>>>>>>>>> '\000', gid_index = 0 '\000',
>>>>>>>>>>>>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000'
>>>>>>>>>>>>> <repeats 15 times>, flow_label = 0,
>>>>>>>>>>>>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data =
>>>>>>>>>>>>> 0x7fffe8012530 "\002"}
>>>>>>>>>>>>
>>>>>>>>>>>> Is this the PathRecord query on the OpenMPI side or the response
>>>>>>>>>>>> on the
>>>>>>>>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>>>>>>>>> This is the response on the OpenSM side (inside the umad_send
>>>>>>>>>>> function, right before it is written to the device with write(fd,
>>>>>>>>>>> …).
>>>>>>>>>>> SL=6 indicates, that the MPI process was sending the request on SL
>>>>>>>>>>> 6.
>>>>>>>>>>
>>>>>>>>>> What is SMSL for the requester ? Was it SL 6 ?
>>>>>>>>> Yes, it was SL 6.
>>>>>>>>> Here is a content of a similar packet which was received by the SA. I
>>>>>>>>> have used ibdump on the port where the OpenSM was running:
>>>>>>>>> ======================================================================================
>>>>>>>>> No. Time Source Destination
>>>>>>>>> Protocol Length Info
>>>>>>>>> 785 14.352168 LID: 384 LID: 4140
>>>>>>>>> InfiniBand 290 UD Send Only SubnAdmGet(PathRecord)
>>>>>>>>>
>>>>>>>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320
>>>>>>>>> bits)
>>>>>>>>> Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>>>>>>>> Epoch Time: 1355389784.437633332 seconds
>>>>>>>>> [Time delta from previous captured frame: 4.332020528 seconds]
>>>>>>>>> [Time delta from previous displayed frame: 4.332020528 seconds]
>>>>>>>>> [Time since reference or first frame: 14.352168681 seconds]
>>>>>>>>> Frame Number: 785
>>>>>>>>> Frame Length: 290 bytes (2320 bits)
>>>>>>>>> Capture Length: 290 bytes (2320 bits)
>>>>>>>>> [Frame is marked: False]
>>>>>>>>> [Frame is ignored: False]
>>>>>>>>> [Protocols in frame: erf:infiniband]
>>>>>>>>> Extensible Record Format
>>>>>>>>> [ERF Header]
>>>>>>>>> Timestamp: 0x50c99b587008bcf2
>>>>>>>>> [Header type]
>>>>>>>>> .001 0101 = type: INFINIBAND (21)
>>>>>>>>> 0... .... = Extension header present: 0
>>>>>>>>> 0000 0100 = flags: 4
>>>>>>>>> .... ..00 = capture interface: 0
>>>>>>>>> .... .1.. = varying record length: 1
>>>>>>>>> .... 0... = truncated: 0
>>>>>>>>> ...0 .... = rx error: 0
>>>>>>>>> ..0. .... = ds error: 0
>>>>>>>>> 00.. .... = reserved: 0
>>>>>>>>> record length: 306
>>>>>>>>> loss counter: 0
>>>>>>>>> wire length: 290
>>>>>>>>> InfiniBand
>>>>>>>>> Local Route Header
>>>>>>>>> 0110 .... = Virtual Lane: 0x06
>>>>>>>>> .... 0000 = Link Version: 0
>>>>>>>>> 0110 .... = Service Level: 6
>>>>>>>>> .... 00.. = Reserved (2 bits): 0
>>>>>>>>> .... ..10 = Link Next Header: 0x02
>>>>>>>>> Destination Local ID: 19
>>>>>>>>> 0000 0... .... .... = Reserved (5 bits): 0
>>>>>>>>> .... .000 0100 1000 = Packet Length: 72
>>>>>>>>> Source Local ID: 16
>>>>>>>>> Base Transport Header
>>>>>>>>> Opcode: 100
>>>>>>>>> 1... .... = Solicited Event: True
>>>>>>>>> .1.. .... = MigReq: True
>>>>>>>>> ..00 .... = Pad Count: 0
>>>>>>>>> .... 0000 = Header Version: 0
>>>>>>>>> Partition Key: 65535
>>>>>>>>> Reserved (8 bits): 0
>>>>>>>>> Destination Queue Pair: 0x000001
>>>>>>>>> 0... .... = Acknowledge Request: False
>>>>>>>>> .000 0000 = Reserved (7 bits): 0
>>>>>>>>> Packet Sequence Number: 0
>>>>>>>>> DETH - Datagram Extended Transport Header
>>>>>>>>> Queue Key: 2147549184
>>>>>>>>> Reserved (8 bits): 0
>>>>>>>>> Source Queue Pair: 0x00380050
>>>>>>>>> MAD Header - Common Management Datagram
>>>>>>>>> Base Version: 0x01
>>>>>>>>> Management Class: 0x03
>>>>>>>>> Class Version: 0x02
>>>>>>>>> Method: Get() (0x01)
>>>>>>>>> Status: 0x0000
>>>>>>>>> Class Specific: 0x0000
>>>>>>>>> Transaction ID: 0x0010000f38005000
>>>>>>>>> Attribute ID: 0x0035
>>>>>>>>> Reserved: 0x0000
>>>>>>>>> Attribute Modifier: 0x00000000
>>>>>>>>> MAD Data Payload:
>>>>>>>>> 000000000000000000000000000000000000000000000000...
>>>>>>>>> Illegal RMPP Type (0)!
>>>>>>>>> RMPP Type: 0x00
>>>>>>>>> RMPP Type: 0x00
>>>>>>>>> 0000 .... = R Resp Time: 0x00
>>>>>>>>> .... 0000 = RMPP Flags: Unknown (0x00)
>>>>>>>>> RMPP Status: (Normal) (0x00)
>>>>>>>>> RMPP Data 1: 0x00000000
>>>>>>>>> RMPP Data 2: 0x00000000
>>>>>>>>> SMASubnAdmGet(PathRecord)
>>>>>>>>> SM_Key (Verification Key): 0x0000000000000000
>>>>>>>>> Attribute Offset: 0x0000
>>>>>>>>> Reserved: 0x0000
>>>>>>>>> Component Mask: 0x0000003000000000
>>>>>>>>> Attribute (PathRecord)
>>>>>>>>> PathRecord
>>>>>>>>> DGID: :: (::)
>>>>>>>>> SGID: ::0.15.0.16 (::0.15.0.16)
>>>>>>>>> DLID: 0x0000
>>>>>>>>> SLID: 0x0000
>>>>>>>>> 0... .... = RawTraffic: 0x00
>>>>>>>>> .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>>>>>> HopLimit: 0x00
>>>>>>>>> TClass: 0x00
>>>>>>>>> 0... .... = Reversible: 0x00
>>>>>>>>> .000 0000 = NumbPath: 0x00
>>>>>>>>> P_Key: 0x0000
>>>>>>>>> .... .... .... 0000 = SL: 0x0000
>>>>>>>>> 00.. .... = MTUSelector: 0x00
>>>>>>>>> ..00 0000 = MTU: 0x00
>>>>>>>>> 00.. .... = RateSelector: 0x00
>>>>>>>>> ..00 0000 = Rate: 0x00
>>>>>>>>> 00.. .... = PacketLifeTimeSelector: 0x00
>>>>>>>>> ..00 0000 = PacketLifeTime: 0x00
>>>>>>>>> Preference: 0x00
>>>>>>>>> Variant CRC: 0xad4e
>>>>>>>>> ======================================================================================
>>>>>>>>
>>>>>>>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
>>>>>>>> out that machine and the issue is internal to that machine. It could be
>>>>>>>> because of the underlying issue which hangs OpenSM when some IB program
>>>>>>>> tried to unregister from the MAD layer but there were outstanding work
>>>>>>>> completions. That's based on your original email earlier this AM.
>>>>>>> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI
>>>>>>> side and the SA uses a SL>0.
>>>>>>
>>>>>> Can ibdump be used to capture output on the SM port ?
>>>>>
>>>>> Yes, that works quite well, despite the warning in the ibdump manual.
>>>>> But I have started ibdump before opensm, maybe that makes a difference,
>>>>> not sure.
>>>>>
>>>>> Regards,
>>>>> Jens
>>>>>
>>>>> PS: I have seen a small bug. Not sure if its a bug in wireshark or
>>>>> ibdump, but the response received by the OMPI node isn't shown correctly.
>>>>> The PathRecord contains an offset which is either missing in the dump or
>>>>> is not treated correctly be wireshark. But it causes wireshark to show
>>>>> the PathRecord data with wrong values.
>>>>> Maybe you could redirect this to the developer of ibdump, so that he can
>>>>> check/fix it.
>>>>
>>>> Are you referring to the fields after the SA AttributeOffset or
>>>> something else ?
>>> Yes, after the SMASubnAdmGet Attribute Offset. Here an example:
>>> I get on the OMPI side:
>>> SMASubnAdmGetResp(PathRecord)
>>> SM_Key (Verification Key): 0x0000000000000000
>>> Attribute Offset: 0x0008
>>> Reserved: 0x0000
>>> Component Mask: 0x0000803000000000
>>> Attribute (PathRecord)
>>> PathRecord
>>> DGID: ::8:f104:399:ebb5:fe80:0 (::8:f104:399:ebb5:fe80:0)
>>> SGID: ::8:f104:399:ecd5:4:8 (::8:f104:399:ecd5:4:8)
>>> DLID: 0x0000
>>> SLID: 0x0000
>>> 0... .... = RawTraffic: 0x00
>>> .... 0000 1000 0000 1111 1111 = FlowLabel: 0x0080ff
>>> HopLimit: 0xff
>>> TClass: 0x00
>>> 0... .... = Reversible: 0x00
>>> .000 0011 = NumbPath: 0x03
>>> P_Key: 0x8486
>>> .... .... .... 0000 = SL: 0x0000
>>> 00.. .... = MTUSelector: 0x00
>>> ..00 0000 = MTU: 0x00
>>> 00.. .... = RateSelector: 0x00
>>> ..00 0000 = Rate: 0x00
>>> 00.. .... = PacketLifeTimeSelector: 0x00
>>> ..00 0000 = PacketLifeTime: 0x00
>>> Preference: 0x00
>>>
>>> But it should show (see the difference in SLID, DLID, SL which are now
>>> correct):
>>> SMASubnAdmGetResp(PathRecord)
>>> SM_Key (Verification Key): 0x0000000000000000
>>> Attribute Offset: 0x0008
>>> Reserved: 0x0000
>>> Component Mask: 0x0000803000000000
>>> Attribute (PathRecord)
>>> PathRecord
>>> DGID: ::8:f104:399:ebb5 (::8:f104:399:ebb5)
>>> SGID: fe80::8:f104:399:ecd5 (fe80::8:f104:399:ecd5)
>>> DLID: 0x0004
>>> SLID: 0x0008
>>> 0... .... = RawTraffic: 0x00
>>> .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>> HopLimit: 0x00
>>> TClass: 0x00
>>> 1... .... = Reversible: 0x01
>>> .000 0000 = NumbPath: 0x00
>>> P_Key: 0xffff
>>> .... .... .... 0011 = SL: 0x0003
>>> 10.. .... = MTUSelector: 0x02
>>> ..00 0100 = MTU: 0x04
>>> 10.. .... = RateSelector: 0x02
>>> ..00 0110 = Rate: 0x06
>>> 10.. .... = PacketLifeTimeSelector: 0x02
>>> ..01 0010 = PacketLifeTime: 0x12
>>> Preference: 0x00
>>
>>
>> I think everything after AttributeOffset is off by 2 bytes. DGID doesn't
>> look right to me (no subnet prefix fe80:: in front of GUID).
>
> Yes, I made a small mistake with the hexeditor. I started the shift after the
> subnet prefix.
> Sorry for the confusion.
>
> Thank you for the hint with smpquery and saquery, I will check that tomorrow.
>
> Jens
>
>>
>> -- Hal
>>
>>>
>>> Regards,
>>> Jens
>>>
>>>>
>>>> -- Hal
>>>>
>>>>>>
>>>>>> -- Hal
>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>>>>>>>>> port) to SA and back to see whether SL6 would even have a chance of
>>>>>>>>>> working (not dropping) aside from whether it's really the correct SL
>>>>>>>>>> to use.
>>>>>>>>> All SL2VL tables look the same. I checked the output of OpenSM.
>>>>>>>>> SL: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
>>>>>>>>> 11 | 12 | 13 | 14 | 15 |
>>>>>>>>> VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2
>>>>>>>>> |0x3 |0x4 |0x5 |0x6 |0x7 |
>>>>>>>>> But this is also as expected, because I have set the QoS in the
>>>>>>>>> opensm config as follows:
>>>>>>>>> qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>>>>>>>>> This was set for "default", "CA" and "Switch external ports". I have
>>>>>>>>> not touched the config for "Switch Port 0" and "Router ports", they
>>>>>>>>> remained: qos_[sw0 | rtr]_sl2vl (null)
>>>>>>>>
>>>>>>>> That works as long as all links have (at least) 8 data VLs (VLCap 4).
>>>>>>> Yes, all VL_CAP show 4 in the OpenSM log file.
>>>>>>>
>>>>>>> Regards
>>>>>>> Jens
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> -- Hal
>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Jens
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -- Hal
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> The output of OpenMPI or OpenSM's log file don't show any useful
>>>>>>>>>>>>> information for this problem, even with higher debug levels.
>>>>>>>>>>>>
>>>>>>>>>>>> So nothing interesting logged relative to the PathRecord queries ?
>>>>>>>>>>> In the OpenSM log, only that it was received, how the request looks
>>>>>>>>>>> like, and that it was send back.
>>>>>>>>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>>>>>>>>>
>>>>>>>>>>>>> So, right now I'm stuck, and have no idea if there is an error in
>>>>>>>>>>>>> the kernel driver, the HCA firmware or something completely
>>>>>>>>>>>>> different. Or if umad_send basically does not support SL>0.
>>>>>>>>>>>>> A workaround for the moment is to set the SL in the
>>>>>>>>>>>>> umad_set_addr_net(...) call to 0.
>>>>>>>>>>>>
>>>>>>>>>>>> So SL 0 works between all nodes and SA for querying/responses.
>>>>>>>>>>>> Wonder if
>>>>>>>>>>>> that's how SMSL is set by DFSSSP.
>>>>>>>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked
>>>>>>>>>>> this. In our case (OpenSM running on a compute node), it sets the
>>>>>>>>>>> same SL, which is used
>>>>>>>>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> Jens
>>>>>>>>>>>
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku,
>>>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>>>> E-Mail: [email protected]
>>>>>>>>>>> --------------------------------
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>> linux-rdma" in
>>>>>>>>>> the body of a message to [email protected]
>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>> --------------------------------
>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku,
>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>> E-Mail: [email protected]
>>>>>>>>> --------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
>>>>>>>> in
>>>>>>>> the body of a message to [email protected]
>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>> --------------------------------
>>>>>>> Dipl.-Math. Jens Domke
>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>> Global Scientific Information and Computing Center
>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku,
>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>> E-Mail: [email protected]
>>>>>>> --------------------------------
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>> the body of a message to [email protected]
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> --------------------------------
>>>>> Dipl.-Math. Jens Domke
>>>>> Researcher - Tokyo Institute of Technology
>>>>> Satoshi MATSUOKA Laboratory
>>>>> Global Scientific Information and Computing Center
>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku,
>>>>> Tokyo, 152-8550, JAPAN
>>>>> Tel/Fax: +81-3-5734-3876
>>>>> E-Mail: [email protected]
>>>>> --------------------------------
>>>>>
>>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--------------------------------
Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku,
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: [email protected]
--------------------------------
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html