Hi Rafael, On 8/25/09, Rafael David Tinoco <[email protected]> wrote: > > Hello Hal, > > Bellow... > > Hal Rosenstock wrote: > > > > On 8/24/09, Rafael David Tinoco <[email protected]> wrote: >> >> Hello, >> >> I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs (2 asics >> each, 8 qnems). >> They are configured in a MESH topology. >> I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5. >> >> I'm booting PXE from IB, my initrd image is bringing the ib0 interface, >> getting the squashfs image and mounting with aufs. >> >> The problem is.. When booting more then 60 nodes, I start to get above >> errors on subnet manager. >> And the problem seems to be intermitent, because each time it gives errors >> on different path. >> >> Any ideas ? >> >> Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice: Reporting >> Generic Notice type:3 num:64 (GID in service) from LID:1 >> GID:fe80::5080:200:8d:9931 >> Aug 24 15:36:19 713838 [48D7D940] 0x02 -> >> __osm_state_mgr_report_new_ports: Discovered new port with >> GUID:0x50800200008d9381 LID range [78,78] of node:b03n06 HCA-1 >> Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice: Reporting >> Generic Notice type:3 num:64 (GID in service) from LID:1 >> GID:fe80::5080:200:8d:9931 >> Aug 24 15:36:19 713842 [48D7D940] 0x02 -> >> __osm_state_mgr_report_new_ports: Discovered new port with >> GUID:0x50800200008d4689 LID range [76,76] of node:b03n04 HCA-1 >> Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice: Reporting >> Generic Notice type:3 num:64 (GID in service) from LID:1 >> GID:fe80::5080:200:8d:9931 >> Aug 24 15:36:19 713847 [48D7D940] 0x02 -> >> __osm_state_mgr_report_new_ports: Discovered new port with >> GUID:0x50800200008e5191 LID range [82,82] of node:b03n11 HCA-1 >> Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice: Reporting >> Generic Notice type:3 num:64 (GID in service) from LID:1 >> GID:fe80::5080:200:8d:9931 >> Aug 24 15:36:19 713866 [48D7D940] 0x02 -> >> __osm_state_mgr_report_new_ports: Discovered new port with >> GUID:0x50800200008d94c9 LID range [80,80] of node:b03n08 HCA-1 >> Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice: Reporting >> Generic Notice type:3 num:64 (GID in service) from LID:1 >> GID:fe80::5080:200:8d:9931 >> Aug 24 15:36:19 713871 [48D7D940] 0x02 -> >> __osm_state_mgr_report_new_ports: Discovered new port with >> GUID:0x50800200008daedd LID range [83,83] of node:b03n12 HCA-1 >> Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP >> Aug 24 15:36:19 714805 [48D7D940] 0x01 -> >> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node >> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 19. >> Adding to light sweep sampling list >> Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4 hop >> path: >> Path = 0,1,15,15,15 >> Aug 24 15:36:19 714822 [48D7D940] 0x01 -> >> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node >> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 21. >> Adding to light sweep sampling list >> Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4 hop >> path: >> Path = 0,1,15,15,15 >> Aug 24 15:36:19 714831 [48D7D940] 0x01 -> >> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node >> 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 25. >> Adding to light sweep sampling list >> Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4 hop >> path: >> Path = 0,1,15,15,15 >> Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409: send >> completed with error (method=0x1 attr=0x15 trans_id=0x4700036595) -- >> dropping >> Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP >> Hop Ptr: 0x0 >> Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop path: >> Initial path = 0,0,0,0,0,0 >> Return path = 0,0,0,0,0,0 >> Aug 24 15:36:20 514333 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: >> ERR 3113: MAD completed in error (IB_TIMEOUT) >> Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump: >> base_ver................0x1 >> mgmt_class..............0x81 >> class_ver...............0x1 >> method..................0x1 (SubnGet) >> D bit...................0x0 >> status..................0x0 >> hop_ptr.................0x0 >> hop_count...............0x5 >> trans_id................0x36595 >> attr_id.................0x15 (PortInfo) >> resv....................0x0 >> attr_mod................0x0 >> m_key...................0x0000000000000000 >> dr_slid.................65535 >> dr_dlid.................65535 >> >> Initial path: 0,1,15,15,15,19 >> Return path: 0,0,0,0,0,0 >> Reserved: [0][0][0][0][0][0][0] >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409: send >> completed with error (method=0x1 attr=0x15 trans_id=0x4700036596) -- >> dropping >> Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP >> Hop Ptr: 0x0 >> Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop path: >> Initial path = 0,0,0,0,0,0 >> Return path = 0,0,0,0,0,0 >> Aug 24 15:36:20 514375 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: >> ERR 3113: MAD completed in error (IB_TIMEOUT) >> Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump: >> base_ver................0x1 >> mgmt_class..............0x81 >> class_ver...............0x1 >> method..................0x1 (SubnGet) >> D bit...................0x0 >> status..................0x0 >> hop_ptr.................0x0 >> hop_count...............0x5 >> trans_id................0x36596 >> attr_id.................0x15 (PortInfo) >> resv....................0x0 >> .... >> > > These errors are transient as you indicate. They mean that some node has > brought the link physically up but there is no SMA at the remote side of the > link. The different paths are paths to the HCAs. This occurs during PXE boot > as the node transitions from the boot ROM to the Linux environment. > > > They are transient.. but sometimes opensm hangs with the same message and > loops this errors messages. >
Are you sure OpenSM hangs ? If so, any idea where ? First I was using centos 5.3 kernel with updates .. and the IPoIB stopped > working after these messages. > Any specifics ? Using the "vanilla" centos 5.3 kernel solved this issue. > But SOMETIMES, booting the nodes, these messages appear and dont go away. > In those cases, do the nodes succesfully boot up ? Other than these messages, do things seem to work in terms of the end > nodes ? > > They seem to work with vanilla kernel. Even with the messages, no problems > reaching the nodes so far. > Do your ULPs work (like IPoIB, etc.) ? -- Hal Tks > > Rafael Tinoco > > > -- Hal > > _______________________________________________ >> general mailing list >> [email protected] >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > >
_______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
