Hello Hal,

Bellow...

Hal Rosenstock wrote:


On 8/24/09, *Rafael David Tinoco* <[email protected] <mailto:[email protected]>> wrote:

    Hello,

    I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs
    (2 asics each, 8 qnems).
    They are configured in a MESH topology.
    I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5.

    I'm booting PXE from IB, my initrd image is bringing the ib0
    interface, getting the squashfs image and mounting with aufs.

    The problem is.. When booting more then 60 nodes, I start to get
    above errors on subnet manager.
    And the problem seems to be intermitent, because each time it
    gives errors on different path.

    Any ideas ?

    Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice:
    Reporting Generic Notice type:3 num:64 (GID in service) from LID:1
    GID:fe80::5080:200:8d:9931
    Aug 24 15:36:19 713838 [48D7D940] 0x02 ->
    __osm_state_mgr_report_new_ports: Discovered new port with
    GUID:0x50800200008d9381 LID range [78,78] of node:b03n06 HCA-1
    Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice:
    Reporting Generic Notice type:3 num:64 (GID in service) from LID:1
    GID:fe80::5080:200:8d:9931
    Aug 24 15:36:19 713842 [48D7D940] 0x02 ->
    __osm_state_mgr_report_new_ports: Discovered new port with
    GUID:0x50800200008d4689 LID range [76,76] of node:b03n04 HCA-1
    Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice:
    Reporting Generic Notice type:3 num:64 (GID in service) from LID:1
    GID:fe80::5080:200:8d:9931
    Aug 24 15:36:19 713847 [48D7D940] 0x02 ->
    __osm_state_mgr_report_new_ports: Discovered new port with
    GUID:0x50800200008e5191 LID range [82,82] of node:b03n11 HCA-1
    Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice:
    Reporting Generic Notice type:3 num:64 (GID in service) from LID:1
    GID:fe80::5080:200:8d:9931
    Aug 24 15:36:19 713866 [48D7D940] 0x02 ->
    __osm_state_mgr_report_new_ports: Discovered new port with
    GUID:0x50800200008d94c9 LID range [80,80] of node:b03n08 HCA-1
    Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice:
    Reporting Generic Notice type:3 num:64 (GID in service) from LID:1
    GID:fe80::5080:200:8d:9931
    Aug 24 15:36:19 713871 [48D7D940] 0x02 ->
    __osm_state_mgr_report_new_ports: Discovered new port with
    GUID:0x50800200008daedd LID range [83,83] of node:b03n12 HCA-1
    Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP
    Aug 24 15:36:19 714805 [48D7D940] 0x01 ->
    __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side
    for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched
    NEM I4A) port 19. Adding to light sweep sampling list
    Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4
    hop path:
                    Path = 0,1,15,15,15
    Aug 24 15:36:19 714822 [48D7D940] 0x01 ->
    __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side
    for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched
    NEM I4A) port 21. Adding to light sweep sampling list
    Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4
    hop path:
                    Path = 0,1,15,15,15
    Aug 24 15:36:19 714831 [48D7D940] 0x01 ->
    __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side
    for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched
    NEM I4A) port 25. Adding to light sweep sampling list
    Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4
    hop path:
                    Path = 0,1,15,15,15
    Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409:
    send completed with error (method=0x1 attr=0x15
    trans_id=0x4700036595) -- dropping
    Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411:
    DR SMP Hop Ptr: 0x0
    Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop
    path:
                    Initial path = 0,0,0,0,0,0
                    Return path  = 0,0,0,0,0,0
    Aug 24 15:36:20 514333 [4977E940] 0x01 ->
    __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
    (IB_TIMEOUT)
    Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump:
                    base_ver................0x1
                    mgmt_class..............0x81
                    class_ver...............0x1
                    method..................0x1 (SubnGet)
                    D bit...................0x0
                    status..................0x0
                    hop_ptr.................0x0
                    hop_count...............0x5
                    trans_id................0x36595
                    attr_id.................0x15 (PortInfo)
                    resv....................0x0
                    attr_mod................0x0
                    m_key...................0x0000000000000000
                    dr_slid.................65535
                    dr_dlid.................65535

                    Initial path: 0,1,15,15,15,19
                    Return path:  0,0,0,0,0,0
                    Reserved:     [0][0][0][0][0][0][0]

                    00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

                    00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

                    00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

                    00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

    Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409:
    send completed with error (method=0x1 attr=0x15
    trans_id=0x4700036596) -- dropping
    Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411:
    DR SMP Hop Ptr: 0x0
    Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop
    path:
                    Initial path = 0,0,0,0,0,0
                    Return path  = 0,0,0,0,0,0
    Aug 24 15:36:20 514375 [4977E940] 0x01 ->
    __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error
    (IB_TIMEOUT)
    Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump:
                    base_ver................0x1
                    mgmt_class..............0x81
                    class_ver...............0x1
                    method..................0x1 (SubnGet)
                    D bit...................0x0
                    status..................0x0
                    hop_ptr.................0x0
                    hop_count...............0x5
                    trans_id................0x36596
                    attr_id.................0x15 (PortInfo)
                    resv....................0x0
    ....

These errors are transient as you indicate. They mean that some node has brought the link physically up but there is no SMA at the remote side of the link. The different paths are paths to the HCAs. This occurs during PXE boot as the node transitions from the boot ROM to the Linux environment.
They are transient.. but sometimes opensm hangs with the same message and loops this errors messages. First I was using centos 5.3 kernel with updates .. and the IPoIB stopped working after these messages.
Using the "vanilla" centos 5.3 kernel solved this issue.
But SOMETIMES, booting the nodes, these messages appear and dont go away.
Other than these messages, do things seem to work in terms of the end nodes ?
They seem to work with vanilla kernel. Even with the messages, no problems reaching the nodes so far.

Tks

Rafael Tinoco
-- Hal

    _______________________________________________
    general mailing list
    [email protected] <mailto:[email protected]>
    http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

    To unsubscribe, please visit
    http://openib.org/mailman/listinfo/openib-general


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to