Hi Eitan, Hi Sasha,
On Sun, 2006-03-05 at 00:22 +0200, Sasha Khapyorsky wrote: > > On the other hand if you run osm with -d1 option (mostly > > single-threaded), then it seems to work indefinitely. > > I've tried your script and don't see any difference between modes with > and without -d1, however my network is small - two hosts and switch, > probably this is different from your. > No, I am testing this on a small setup: two cpus, two switches, one extra machine running osm. Here's the output from ibnetdiscover: # # Topology file: generated on Mon Mar 6 11:36:23 2006 # # Max of 3 hops discovered # Initiated from node 0002c90200007afc port 0002c90200007afd vendid=0x2c9 devid=0xb924 switchguid=0x1393010b186ba0 Switch 24 "S-001393010b186ba0" # MT47396 Infiniscale-III Mellanox Technologies port 0 lid 9 [21] "H-001393000024a510"[1] [17] "H-001393000024a600"[1] [4] "S-001393010b186b08"[12] [3] "S-001393010b186b08"[11] [2] "S-001393010b186b08"[10] [1] "S-001393010b186b08"[9] [8] "H-0002c90200007afc"[1] vendid=0x2c9 devid=0xb924 switchguid=0x1393010b186b08 Switch 24 "S-001393010b186b08" # MT47396 Infiniscale-III Mellanox Technologies port 0 lid 8 [21] "H-001393000024a510"[2] [17] "H-001393000024a600"[2] [8] "H-0002c90200007afc"[2] [12] "S-001393010b186ba0"[4] [11] "S-001393010b186ba0"[3] [10] "S-001393010b186ba0"[2] [9] "S-001393010b186ba0"[1] vendid=0x2c9 devid=0x6282 sysimgguid=0x1393000024a516 caguid=0x1393000024a510 Ca 2 "H-001393000024a510" # MT25218 InfiniHostEx Mellanox Technologies [2] "S-001393010b186b08"[21] # lid 20 lmc 2 [1] "S-001393010b186ba0"[21] # lid 16 lmc 2 vendid=0x2c9 devid=0x6282 sysimgguid=0x1393000024a606 caguid=0x1393000024a600 Ca 2 "H-001393000024a600" # MT25218 InfiniHostEx Mellanox Technologies [2] "S-001393010b186b08"[17] # lid 28 lmc 2 [1] "S-001393010b186ba0"[17] # lid 24 lmc 2 vendid=0x2c9 devid=0x5a44 sysimgguid=0x2c90200007afc caguid=0x2c90200007afc Ca 2 "H-0002c90200007afc" # MT23108 InfiniHost Mellanox Technologies [2] "S-001393010b186b08"[8] # lid 12 lmc 2 [1] "S-001393010b186ba0"[8] # lid 4 lmc 2 There may be a number of possibly significant differences between my setup and yours, though: Both CPUs are quad-opterons, the machine running osm is a dual xeon where osm was also compiled. So it's all 64 bit and SMP. The firmwares are 0.7.0 for the switches and 5.1.0 for the cpus. The osm host has 3.3.3. One more detail, I am running with LMC=2 betcause I wanted to check that the LMC>0 were fixed (they seem to be; I do not see any LMC-related missbehaviour. With -d1 everything looks shipshape). > Also I see that finally port becomes active but after delay. Those > delays look strange and inconsistent, I will need to test more tomorrow. > Could you try such modification for your script? > > i=1 > while true; do > modprobe -r ib_mthca > sleep 3 > modprobe ib_mthca > count=0 > while true ; do > ibstat | egrep 'State: Active$' > /dev/null > test $? -eq 0 && break > count=`expr $count + 1` > sleep 1 > done > echo $i: delay $count > sleep 3 > i=`expr $i + 1` > done > Here's the output from your script. After the last line in doesn't make further progress (I waited something like 10 minutes). Addressing Eitan comment, I tried the same thing with a delay of 7 seconds rather than 3 between modprobe -r and modprobe. The results are the same: 1: delay 0 2: delay 0 3: delay 0 4: delay 0 5: delay 0 6: delay 0 <nothing happens> In case it contains usefull clues, here's a sample of osm's log at around the point things start falling appart: Mar 06 11:31:36 036291 [40A04960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0009 TID:0x00000000000000c4 Mar 06 11:31:36 036452 [40A04960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x001393010b186ba0 Mar 06 11:31:36 044333 [40A04960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0008 TID:0x00000000000000c8 Mar 06 11:31:36 044921 [40A04960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0000 GID:0xfe80000000000000,0x001393010b186b08 Mar 06 11:31:36 056540 [40401960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:36 056562 [40401960] -> Discovered new port with GUID:0x001393000024a511 LID range [0x10,0x13] of node:MT25218 InfiniHostEx Mellanox Technologies Mar 06 11:31:36 056570 [40401960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:36 056578 [40401960] -> Discovered new port with GUID:0x001393000024a512 LID range [0x14,0x17] of node:MT25218 InfiniHostEx Mellanox Technologies Mar 06 11:31:36 056673 [40401960] -> osm_ucast_mgr_process: Min Hop Tables configured on all switches Mar 06 11:31:36 082257 [40A04960] -> osm_ucast_mgr_process: Min Hop Tables configured on all switches Mar 06 11:31:36 446369 [40602960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0010 TID:0x0000000000000000 Mar 06 11:31:36 446400 [40401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0014 TID:0x0000000000000001 Mar 06 11:31:36 446614 [40602960] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0010 GID:0xfe80000000000000,0x001393000024a511 Mar 06 11:31:36 446657 [40401960] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0014 GID:0xfe80000000000000,0x001393000024a512 Mar 06 11:31:36 465919 [40401960] -> osm_ucast_mgr_process: Min Hop Tables configured on all switches Mar 06 11:31:36 473124 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:36 473151 [40A04960] -> Removed port with GUID:0x001393000024a601 LID range [0x18,0x1B] of node:MT25218 InfiniHostEx Mellanox Technologies Mar 06 11:31:36 473196 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:36 473209 [40A04960] -> Removed port with GUID:0x001393000024a602 LID range [0x1C,0x1F] of node:MT25218 InfiniHostEx Mellanox Technologies Mar 06 11:31:36 473526 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:36 473568 [40A04960] -> Removed port with GUID:0x001393010b186b08 LID range [0x8,0x8] of node:MT47396 Infiniscale-III Mellanox Technologies Mar 06 11:31:36 473710 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:36 473722 [40A04960] -> Removed port with GUID:0x001393000024a511 LID range [0x10,0x13] of node:MT25218 InfiniHostEx Mellanox Technologies Mar 06 11:31:36 473758 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:36 473770 [40A04960] -> Removed port with GUID:0x001393000024a512 LID range [0x14,0x17] of node:MT25218 InfiniHostEx Mellanox Technologies Mar 06 11:31:36 474015 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:36 474050 [40A04960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x9,0x9] of node:MT47396 Infiniscale-III Mellanox Technologies Mar 06 11:31:36 474133 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:36 474165 [40A04960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies Mar 06 11:31:36 474238 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:36 474249 [40A04960] -> Removed port with GUID:0x0002c90200007afe LID range [0xC,0xF] of node:MT23108 InfiniHost Mellanox Technologies Mar 06 11:31:36 474267 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2756 Mar 06 11:31:36 474283 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2758 Mar 06 11:31:36 474541 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd Mar 06 11:31:36 474577 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2757 Mar 06 11:31:36 474807 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x275a Mar 06 11:31:36 474827 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2759 Mar 06 11:31:36 474814 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x275b Mar 06 11:31:36 474903 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x275c Mar 06 11:31:36 474999 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x275f Mar 06 11:31:36 475003 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x275e Mar 06 11:31:36 475024 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2760 Mar 06 11:31:36 475038 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x275d Mar 06 11:31:36 475089 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2761 Mar 06 11:31:36 475140 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2762 Mar 06 11:31:36 475158 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2763 Mar 06 11:31:36 475173 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2764 Mar 06 11:31:36 475231 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2765 Mar 06 11:31:36 475248 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2766 Mar 06 11:31:36 475295 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2767 Mar 06 11:31:36 475332 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2768 Mar 06 11:31:36 475367 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x276a Mar 06 11:31:36 475350 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x2769 Mar 06 11:31:36 475432 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x276c Mar 06 11:31:36 475416 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x276b Mar 06 11:31:36 475492 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x276d Mar 06 11:31:36 475522 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x276e Mar 06 11:31:36 475634 [40401960] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS(3) in state OSM_SM_STATE_IDLE Mar 06 11:31:38 040389 [40602960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:38 040409 [40602960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III Mellanox Technologies Mar 06 11:31:38 040419 [40602960] -> __osm_drop_mgr_remove_switch: ERR 0102: Node 0x001393010b186ba0 not in switch table Mar 06 11:31:38 040463 [40602960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:38 040474 [40602960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies Mar 06 11:31:38 040486 [40803960] -> osm_si_rcv_process: ERR 3606: SwitchInfo received for nonexistent node with GUID = 0x1393010b186ba0 Mar 06 11:31:38 040587 [40602960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd Mar 06 11:31:44 280928 [40401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0009 TID:0x00000000000000c5 Mar 06 11:31:44 280976 [40A04960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0008 TID:0x00000000000000c9 Mar 06 11:31:44 282252 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:44 282266 [40A04960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III Mellanox Technologies Mar 06 11:31:44 282274 [40A04960] -> __osm_drop_mgr_remove_switch: ERR 0102: Node 0x001393010b186ba0 not in switch table Mar 06 11:31:44 282304 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:44 282315 [40A04960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies Mar 06 11:31:44 282327 [40602960] -> osm_si_rcv_process: ERR 3606: SwitchInfo received for nonexistent node with GUID = 0x1393010b186ba0 Mar 06 11:31:44 282441 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd Mar 06 11:31:44 283808 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:44 283821 [40A04960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III Mellanox Technologies Mar 06 11:31:44 283829 [40A04960] -> __osm_drop_mgr_remove_switch: ERR 0102: Node 0x001393010b186ba0 not in switch table Mar 06 11:31:44 283859 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:44 283869 [40A04960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies Mar 06 11:31:44 283882 [40401960] -> osm_si_rcv_process: ERR 3606: SwitchInfo received for nonexistent node with GUID = 0x1393010b186ba0 Mar 06 11:31:44 283967 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd Mar 06 11:31:48 047137 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:48 047201 [40A04960] -> Removed port with GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III Mellanox Technologies Mar 06 11:31:48 047290 [40A04960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd Mar 06 11:31:48 047310 [40A04960] -> Removed port with GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost Mellanox Technologies Mar 06 11:31:48 047451 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd Mar 06 11:31:48 047537 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 for parent node GUID = 0x1393010b186ba0, TID = 0x278d Mar 06 11:31:48 047543 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port object for port with GUID = 0x1393010b186ba0 -- Jean-Christophe Hugly <[EMAIL PROTECTED]> PANTA _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general