Bert,

[email protected] wrote:
Hi Bert,

most of these messages indicates that you do have unstable links in your system. But there is one message that can indicate that you've hit a newly discovered SM bug:

__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies)

This message is probably also related to the unstable links (or nodes).
Some port didn't answer a query from the SM (see below), so SM warns
that there is a port that is physically not down, but the other side
of the link couldn't be probed.

If you do have NEM switches in your system, then you are exposed to this bug.
I hit it quite easily.

Yevgeny Kliteynik posted a patch for this bug just a few minutes after you sent your email. (If you are interested look for the email thread "create physp for the
newly discovered port of the known node").

Of course, using the patch wouldn't hurt :)

Line

On 02/17/09 01:23 PM, Wiegers, Bert wrote:
Hi,

we are using the ofed 1.4 /w OpenSM 3.2.5_20081207 with a Switch from
SUN.
As we are debugging our System I'm trying to understand the
opensm.log's.
(Where can I find any documentation to that?)


We see frequent messages as follows:

Feb 17 10:25:34 134964 [41802940] 0x01 ->
__osm_trap_rcv_process_request: Received Generic Notice type:1 num:128
(Link state change) Producer:2 (Switch) from LID:111
TID:0x000000000000006e
Feb 17 10:25:34 169578 [41802940] 0x02 -> osm_report_notice: Reporting
Generic Notice type:1 num:128 (Link state change) from LID:111
GID:fe80::14:4fa4:cff8:50

Generic notice num. 128 (trap 128) is issued by switch (LID 111) because
it detected port state change on one of its ports, could be because of
unstable link, could be something else. SM logs that it got this trap from
the switch.


Feb 17 10:25:39 088014 [43806940] 0x02 -> osm_report_notice: Reporting
Generic Notice type:3 num:65 (GID out of service) from LID:336
GID:fe80::3:ba00:100:3341

SM can't find some port any more, so it informs the fabric that
this GID is "out of service" by sending notice num. 65.

Feb 17 10:25:39 088030 [43806940] 0x02 -> __osm_drop_mgr_remove_port:
Removed port with GUID:0x00144fa4cff8000d LID range [1047, 1047] of
node:MT25408 ConnectX Mellanox Technologies

LID 1047 is no longer reachable and removed from the SM's DB.

Feb 17 10:25:39 614565 [43806940] 0x02 -> osm_ucast_mgr_process: minhop
tables configured on all switches
Feb 17 10:25:44 013836 [43806940] 0x02 -> SUBNET UP
Feb 17 10:25:46 662611 [41802940] 0x01 ->
__osm_trap_rcv_process_request: Received Generic Notice type:1 num:128
(Link state change) Producer:2 (Switch) from LID:111
TID:0x000000000000006f
Feb 17 10:25:46 662703 [41802940] 0x02 -> osm_report_notice: Reporting
Generic Notice type:1 num:128 (Link state change) from LID:111
GID:fe80::14:4fa4:cff8:50
Feb 17 10:25:48 097096 [43806940] 0x02 -> osm_ucast_mgr_process: minhop
tables configured on all switches
Feb 17 10:25:52 476653 [44007940] 0x01 ->
__osm_sm_mad_ctrl_rcv_callback: ERR 3111: Error status = 0x1C00
Feb 17 10:25:52 476729 [44007940] 0x01 -> SMP dump:
                                base_ver................0x1
                                mgmt_class..............0x81
                                class_ver...............0x1
                                method..................0x81
(SubnGetResp)
                                D bit...................0x1
                                status..................0x1C00
                                hop_ptr.................0x0
                                hop_count...............0x4
                                trans_id................0x18c08de
                                attr_id.................0x15 (PortInfo)
                                resv....................0x0
                                attr_mod................0x6
m_key...................0x0000000000000000
                                dr_slid.................65535
                                dr_dlid.................65535

                                Initial path: 0,1,10,15,23
                                Return path:  0,23,20,12,17
                                Reserved:     [0][0][0][0][0][0][0]

                                00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

                                00 00 00 00 00 00 00 00   00 00 00 00 11
03 03 02

                                34 52 00 23 40 40 00 08   08 04 F0 4C 00
00 00 00

                                00 00 00 00 00 88 00 00   00 00 00 00 00
00 00 00




Other issues I see with messages similar to the following ones:

__osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for
node 0x00144fa4d3860050(MT47396 Infiniscale-III Mellanox Technologies)
po

__osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
(IB_TIMEOUT)

The above two messages are related. The IB_TIMEOUT says that some MAD
was sent, but no response was received. This, in turn, would cause the
"unknown remote side" message.

Bottom line - there might be unstable ports/links in the fabric.
Check all the links that reported by the SM as having an unknown
remote side.

-- Yevgeny

osm_vendor_send: ERR 5430: Send p_madw = 0x116d320 of size 256 failed -5
(Invalid argument)

I'm still googleing, but hopefully someone can give me some answers.



Thanks and best regards
Bert

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to