On Mon, 2007-05-21 at 22:23, Venkatesh Babu wrote: > Hal Rosenstock wrote: > > >So there is no link between the 2 switches, right ? > > > > > That is right. > > > > >Is there anything being done ? Cables pulled and reinserted ? Is > >anything changing or is this a "stable" configuration in terms of the > >topology ? > > > > > There was no configuration changes from the cable or switch > perspective. But nodes were being rebooted. > > >Is this the only thing going on on the subnet ? > > > > > That was ipoib but no other ulp modules. There was propritery ulp > module which creates udqp and joins broadcast > group and discovers nodes and sets up rcqps. There was no traffic being run. > > >So it did finally become master ? > > > > > Yes, from the /var/log/opensm1.log it looks like it became master. But > it was not responding to > link local broadcast join operations. It was failing with -110, > Connection timed out. > > >I take it LID 6 is local (vortex31-83). > > > >This looks like a pretty old OpenSM. Is it OFED 1.1 or older ? Can you > >try OFED 1.2 ? > > > > > It is OFED 1.1 released stack. I have seen this problem with OFED 1.0 > also. > Trying with OFED 1.2 may take much longer time, since we need to port > our stuff.
Can you at least use OFED 1.2 management (OpenSM and management libraries) with the rest being OFED 1.1 ? There are a number of bugs which have been fixed which might affect this. The one I can think of off the top of my head is a fix to atomics in OpenSM's complib. I think that was found and fixed post OFED 1.1. I'll confirm this tomorrow. There may also be some important kernel differences (in user_mad.c or mad.c) which might be relevant. > >What kernel is being used ? What distro ? What processor architecture ? > > > > > 2.6.9-22.EL RHEL 4.2 Dual Core AMD Opteron(tm) Processor > 270 HE > > > > >Is this around the time of the error or just an error in the OpenSM log > >? > > > > > The logs were frozen after these error messages. No new entries were > being written to the log files. > After doing "sminfo -s3" I saw the some messages indicating that it > moved to MASTER state and other messages. > > May 21 00:40:28 013290 [41401960] -> __osm_trap_rcv_process_request: > Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0007 > TID:0x0000000000000003 > May 21 00:40:28 013431 [41401960] -> osm_report_notice: Reporting > Generic Notice type:4 num:144 from LID:0x0007 > GID:0xfe80000000000000,0x005045014a2e0001 > May 21 00:40:28 818202 [45007960] -> umad_receiver: ERR 5409: send > completed with error (method=0x1 attr=0x11 trans_id=0x100000135b) -- > dropping > May 21 00:40:28 819089 [45007960] -> umad_receiver: ERR 5411: DR SMP > May 21 00:40:28 819110 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > 3113: MAD completed in error (IB_TIMEOUT) > May 21 00:40:28 819145 [45007960] -> SMP dump: > ... > May 21 00:40:28 819247 [41E02960] -> Entering STANDBY state > May 21 14:04:17 204871 [45007960] -> umad_receiver: ERR 5404: recv error > on MAD sized umad (Interrupted system call) > May 21 14:06:08 022096 [45007960] -> umad_receiver: ERR 5409: send > completed with error (method=0x1 attr=0x20 trans_id=0x100000264f) -- > dropping > May 21 14:06:08 022132 [45007960] -> umad_receiver: ERR 5411: DR SMP > May 21 14:06:08 022145 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > 3113: MAD completed in error (IB_TIMEOUT) > May 21 14:06:08 022182 [45007960] -> SMP dump: > ... > May 21 14:06:38 035957 [41401960] -> Entering MASTER state > May 21 14:06:38 038818 [42803960] -> osm_subn_set_up_down_min_hop_table: > BFS through all port guids in the subnet ] > May 21 14:06:38 038886 [42803960] -> osm_ucast_mgr_process: Min Hop > Tables configured on all switches > May 21 14:06:38 046438 [41401960] -> __osm_trap_rcv_process_request: > Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x000C > TID:0x0000000000000ec4 > May 21 14:06:38 046565 [41401960] -> osm_report_notice: Reporting > Generic Notice type:1 num:128 from LID:0x000C > GID:0xfe80000000000000,0x000b8cffff0024f9 > May 21 14:06:38 108660 [42803960] -> SUBNET UP > May 21 14:06:38 402900 [41401960] -> __osm_trap_rcv_process_request: > Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0001 > TID:0x0000000000000000 > May 21 14:06:38 403007 [41401960] -> osm_report_notice: Reporting > Generic Notice type:4 num:144 from LID:0x0001 > GID:0xfe80000000000000,0x0002c9020020f5c5 > May 21 14:06:38 914806 [45007960] -> umad_receiver: ERR 5409: send > completed with error (method=0x1 attr=0x20 trans_id=0x10000026f0) -- > dropping > May 21 14:06:38 914823 [45007960] -> umad_receiver: ERR 5411: DR SMP > May 21 14:06:38 914864 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > 3113: MAD completed in error (IB_TIMEOUT) > May 21 14:06:38 914899 [45007960] -> SMP dump: > > >Did this change from 0 to 1 around the time of the issue with the SM > >mastership ? > > > > > Not sure, I just got the snapshot when I saw this problem. > > >Also, what are the port counters for the switch ports in use ? > > > > > [EMAIL PROTECTED] ~]# ibnetdiscover I was referring to using perfquery, not ibnetdiscover. > ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed, > skipping port Was this node rebooting while you did this or is there some other issue ? > # > # Topology file: generated on Mon May 21 02:11:34 2007 > # > # Max of 2 hops discovered > # Initiated from node 005045014a3a0000 port 005045014a3a0001 > > vendid=0x2c9 > devid=0xb924 > sysimgguid=0xb8cffff0024f9 > switchguid=0xb8cffff0024f9 > Switch 24 "S-000b8cffff0024f9" # MT47396 Infiniscale-III Mellanox > Technologies base port 0 lid 12 lmc 0 > [18] "H-005045014a2e0000"[1] > [11] "H-0002c902002048b0"[1] > [10] "H-0002c9020020f584"[1] > [19] "H-005045014a3a0000"[1] So run these (before and after): perfquery 12 18 perfquery 12 11 perfquery 12 10 perfquery 12 19 and perfquery 12 9 -- Hal > vendid=0x2c9 > devid=0x6282 > sysimgguid=0x5045014a2e0003 > caguid=0x5045014a2e0000 > Ca 2 "H-005045014a2e0000" # vortex3l-84 HCA-1 > [1] "S-000b8cffff0024f9"[18] # lid 7 lmc 0 > > vendid=0x2c9 > devid=0x6282 > sysimgguid=0x2c902002048b3 > caguid=0x2c902002048b0 > Ca 2 "H-0002c902002048b0" # MT25218 InfiniHostEx Mellanox > Technologies > [1] "S-000b8cffff0024f9"[11] # lid 5 lmc 0 > > vendid=0x2c9 > devid=0x6282 > sysimgguid=0x2c9020020f587 > caguid=0x2c9020020f584 > Ca 2 "H-0002c9020020f584" # MT25218 InfiniHostEx Mellanox > Technologies > [1] "S-000b8cffff0024f9"[10] # lid 8 lmc 0 > > vendid=0x2c9 > devid=0x6282 > sysimgguid=0x5045014a3a0003 > caguid=0x5045014a3a0000 > Ca 2 "H-005045014a3a0000" # vortex3l-83 HCA-1 > [1] "S-000b8cffff0024f9"[19] # lid 6 lmc 0 > [EMAIL PROTECTED] ~]# > > >Perhaps later; not just yet. > > > > > >Are they all the same ? > > > > > More or less they are same. All of them have 9 threads and each thread > is blocking form some event. > > VBabu _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
