Hello,

I'm installing an HPC cluster using 2 Sun Blades 6048 with QNEMs (2 asics each, 8 qnems).
They are configured in a MESH topology.
I'm using Centos 5.3, OFED 1.4.1 and kernel 2.6.18-128.el5.

I'm booting PXE from IB, my initrd image is bringing the ib0 interface, getting the squashfs image and mounting with aufs.

The problem is.. When booting more then 60 nodes, I start to get above errors on subnet manager. And the problem seems to be intermitent, because each time it gives errors on different path.

Any ideas ?

Aug 24 15:36:19 713836 [48D7D940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 GID:fe80::5080:200:8d:9931 Aug 24 15:36:19 713838 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x50800200008d9381 LID range [78,78] of node:b03n06 HCA-1 Aug 24 15:36:19 713840 [48D7D940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 GID:fe80::5080:200:8d:9931 Aug 24 15:36:19 713842 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x50800200008d4689 LID range [76,76] of node:b03n04 HCA-1 Aug 24 15:36:19 713845 [48D7D940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 GID:fe80::5080:200:8d:9931 Aug 24 15:36:19 713847 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x50800200008e5191 LID range [82,82] of node:b03n11 HCA-1 Aug 24 15:36:19 713849 [48D7D940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 GID:fe80::5080:200:8d:9931 Aug 24 15:36:19 713866 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x50800200008d94c9 LID range [80,80] of node:b03n08 HCA-1 Aug 24 15:36:19 713869 [48D7D940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:1 GID:fe80::5080:200:8d:9931 Aug 24 15:36:19 713871 [48D7D940] 0x02 -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x50800200008daedd LID range [83,83] of node:b03n12 HCA-1
Aug 24 15:36:19 714782 [48D7D940] 0x02 -> SUBNET UP
Aug 24 15:36:19 714805 [48D7D940] 0x01 -> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 19. Adding to light sweep sampling list
Aug 24 15:36:19 714812 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
               Path = 0,1,15,15,15
Aug 24 15:36:19 714822 [48D7D940] 0x01 -> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 21. Adding to light sweep sampling list
Aug 24 15:36:19 714827 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
               Path = 0,1,15,15,15
Aug 24 15:36:19 714831 [48D7D940] 0x01 -> __osm_state_mgr_light_sweep_start: ERR 3315: Unknown remote side for node 0x0021283a85260040(Sun Blade 6048 InfiniBand QDR Switched NEM I4A) port 25. Adding to light sweep sampling list
Aug 24 15:36:19 714835 [48D7D940] 0x01 -> Directed Path Dump of 4 hop path:
               Path = 0,1,15,15,15
Aug 24 15:36:20 514302 [4977E940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x15 trans_id=0x4700036595) -- dropping Aug 24 15:36:20 514321 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0
Aug 24 15:36:20 514328 [4977E940] 0x01 -> Received SMP on a 5 hop path:
               Initial path = 0,0,0,0,0,0
               Return path  = 0,0,0,0,0,0
Aug 24 15:36:20 514333 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT)
Aug 24 15:36:20 514352 [4977E940] 0x01 -> SMP dump:
               base_ver................0x1
               mgmt_class..............0x81
               class_ver...............0x1
               method..................0x1 (SubnGet)
               D bit...................0x0
               status..................0x0
               hop_ptr.................0x0
               hop_count...............0x5
               trans_id................0x36595
               attr_id.................0x15 (PortInfo)
               resv....................0x0
               attr_mod................0x0
               m_key...................0x0000000000000000
               dr_slid.................65535
               dr_dlid.................65535

               Initial path: 0,1,15,15,15,19
               Return path:  0,0,0,0,0,0
               Reserved:     [0][0][0][0][0][0][0]

               00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

               00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

               00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

               00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

Aug 24 15:36:20 514364 [4977E940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x15 trans_id=0x4700036596) -- dropping Aug 24 15:36:20 514367 [4977E940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0
Aug 24 15:36:20 514372 [4977E940] 0x01 -> Received SMP on a 5 hop path:
               Initial path = 0,0,0,0,0,0
               Return path  = 0,0,0,0,0,0
Aug 24 15:36:20 514375 [4977E940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT)
Aug 24 15:36:20 514391 [4977E940] 0x01 -> SMP dump:
               base_ver................0x1
               mgmt_class..............0x81
               class_ver...............0x1
               method..................0x1 (SubnGet)
               D bit...................0x0
               status..................0x0
               hop_ptr.................0x0
               hop_count...............0x5
               trans_id................0x36596
               attr_id.................0x15 (PortInfo)
               resv....................0x0
....


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to