On Mon, 2007-05-21 at 18:00, Venkatesh Babu wrote: > Configuration: > - 4 nodes in the IB network with two nodes running OpenSMs. > - Each node has MT25218 InfiniHostEx Mellanox with two IB ports and > with firmware version 5.2.0 > - All node's IB port 1 is connected to IB Switch1, say subnet1 > - All node's IB port 2 is connected to IB Switch2, say subnet2
So there is no link between the 2 switches, right ? > - vortex3l-83 has two opensm's for each subnet with priority 0 > - vortex3l-84 has two opensm's for each subnet with priority 13 > > Problem: > > The problem is opensm's on both the machines are in Standy state and none of > them are > claiming the mastership, though they have different priorities 0 and 13. Most > of the times > this configuration works fine, but ocassionally it is getting into this > problem. It is hard > to reproduce this problem. Is there anything being done ? Cables pulled and reinserted ? Is anything changing or is this a "stable" configuration in terms of the topology ? Is this the only thing going on on the subnet ? > I tried to set the mastership of the opensm but it didn't worked. > [EMAIL PROTECTED] ~]# sminfo -s3 > sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5043 priority 0 > state 2 SMINFO_STANDBY > > After couple of minutes > [EMAIL PROTECTED] ~]# sminfo > sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5938 priority 0 > state 3 SMINFO_MASTER So it did finally become master ? I take it LID 6 is local (vortex31-83). > Data: > > [EMAIL PROTECTED] ~]# sminfo > sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4937 priority 0 > state > 2 SMINFO_STANDBY > [EMAIL PROTECTED] ~]# ps -aux | grep open > Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ > root 5237 0.0 0.0 92848 1692 ? Sl 00:39 0:00 /usr/bin/opensm > -g 0x005045014a3a0001 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm1.log > root 5250 0.0 0.0 92848 1700 ? Sl 00:39 0:00 /usr/bin/opensm > -g 0x005045014a3a0002 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm2.log > root 8356 0.0 0.0 51064 708 pts/0 S+ 13:40 0:00 grep open > > > [EMAIL PROTECTED] ~]# sminfo > sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4939 priority 0 > state > 2 SMINFO_STANDBY > [EMAIL PROTECTED] ~]# ps -aux | grep open > Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ > root 5871 0.0 0.0 92848 1560 ? Sl 00:40 0:00 /usr/bin/opensm > -g 0x005045014a2e0001 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm1.log > root 5884 0.0 0.0 92848 1568 ? Sl 00:40 0:00 /usr/bin/opensm > -g 0x005045014a2e0002 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm2.log > root 8845 0.0 0.0 51084 668 pts/0 S+ 13:40 0:00 grep open > > > But ibv_devinfo on vortex3l-83 shows that both ports are active and sm_lid and > lid are same, indicating it is master. Looks like it is the stale information. > > [EMAIL PROTECTED] ~]# ibv_devinfo > hca_id: mthca0 > fw_ver: 5.2.0 > node_guid: 0050:4501:4a3a:0000 > sys_image_guid: 0050:4501:4a3a:0003 > vendor_id: 0x02c9 > vendor_part_id: 25218 > hw_ver: 0xA0 > board_id: ARM0020000001 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 6 > port_lid: 6 > port_lmc: 0x00 > > port: 2 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 1 > port_lid: 1 > port_lmc: 0x00 > > [EMAIL PROTECTED] ~]# ibv_devinfo > hca_id: mthca0 > fw_ver: 5.2.0 > node_guid: 0050:4501:4a2e:0000 > sys_image_guid: 0050:4501:4a2e:0003 > vendor_id: 0x02c9 > vendor_part_id: 25218 > hw_ver: 0xA0 > board_id: ARM0020000001 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 6 > port_lid: 7 > port_lmc: 0x00 > > port: 2 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 1 > port_lid: 2 > port_lmc: 0x00 > > > > And also in /var/log/opensm[1/2].log I see the following error messages - > > May 21 00:40:28 250119 [95A9A160] -> OpenSM Rev:openib-2.0.5 OpenIB svn 4954M This looks like a pretty old OpenSM. Is it OFED 1.1 or older ? Can you try OFED 1.2 ? What kernel is being used ? What distro ? What processor architecture ? > May 21 00:40:28 484648 [95A9A160] -> osm_vendor_bind: Binding to port > 0x5045014a2e0001 > May 21 00:40:28 487418 [95A9A160] -> osm_vendor_bind: Binding to port > 0x5045014a2e0001 > May 21 00:40:29 292689 [45007960] -> umad_receiver: ERR 5409: send completed > with error (method=0x1 attr=0x11 trans_id=0x100000125f) -- dropping > May 21 00:40:29 292728 [45007960] -> umad_receiver: ERR 5411: DR SMP > May 21 00:40:29 292741 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: > MAD completed in error (IB_TIMEOUT) > May 21 00:40:29 292818 [45007960] -> SMP dump: Is this around the time of the error or just an error in the OpenSM log ? > I found that for both ports on both vortex boxes I see the port_xmit_discards > counter was 1. Did this change from 0 to 1 around the time of the issue with the SM mastership ? Also, what are the port counters for the switch ports in use ? > Other error counters seems to be zero. Looks like some packets > has been transmitted and received on both machines. > > [EMAIL PROTECTED] ~]# cat > /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards > 1 > [EMAIL PROTECTED] ~]# cat > /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards > 1 > > [EMAIL PROTECTED] ~]# cat > /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards > 1 > [EMAIL PROTECTED] ~]# cat > /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards > 1 > > [EMAIL PROTECTED] ~]# cat > /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards > 1 > > > Link speed seems to be set to 10 Gb/sec (4X) on all machines. > > > I have the opensm logs and gdb output for all the opensms. If you want I > can send it to you. Perhaps later; not just yet. > Just attaching one sample gdb output with stack traces of all threads. Are they all the same ? -- Hal > [EMAIL PROTECTED] ~]# gdb /usr/bin/opensm 5237 > GNU gdb Red Hat Linux (6.3.0.0-1.63rh) > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "x86_64-redhat-linux-gnu"... > (no debugging symbols found) > Using host libthread_db library "/lib64/tls/libthread_db.so.1". > > Attaching to program: /usr/bin/opensm, process 5237 > Reading symbols from /usr/lib/libopensm.so.1...done. > Loaded symbols for /usr/lib/libopensm.so.1 > Reading symbols from /usr/lib/libosmcomp.so.1...done. > Loaded symbols for /usr/lib/libosmcomp.so.1 > Reading symbols from /lib64/tls/libpthread.so.0...done. > [Thread debugging using libthread_db enabled] > [New Thread 182904123744 (LWP 5237)] > [New Thread 1157658976 (LWP 5267)] > [New Thread 1147169120 (LWP 5266)] > [New Thread 1136679264 (LWP 5265)] > [New Thread 1126189408 (LWP 5264)] > [New Thread 1115699552 (LWP 5263)] > [New Thread 1105209696 (LWP 5262)] > [New Thread 1094719840 (LWP 5261)] > [New Thread 1084229984 (LWP 5253)] > Loaded symbols for /lib64/tls/libpthread.so.0 > Reading symbols from /usr/lib/libosmvendor.so.2...done. > Loaded symbols for /usr/lib/libosmvendor.so.2 > Reading symbols from /usr/lib/libibcommon.so.1...done. > Loaded symbols for /usr/lib/libibcommon.so.1 > Reading symbols from /usr/lib/libibumad.so.1...done. > Loaded symbols for /usr/lib/libibumad.so.1 > Reading symbols from /lib64/tls/libc.so.6...done. > Loaded symbols for /lib64/tls/libc.so.6 > Reading symbols from /lib64/ld-linux-x86-64.so.2...done. > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > (gdb) info threads > 9 Thread 1084229984 (LWP 5253) 0x0000002a9589facf in > pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 8 Thread 1094719840 (LWP 5261) 0x0000002a9589f8da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 7 Thread 1105209696 (LWP 5262) 0x0000002a9589f8da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 6 Thread 1115699552 (LWP 5263) 0x0000002a9589f8da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 5 Thread 1126189408 (LWP 5264) 0x0000002a9589f8da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 4 Thread 1136679264 (LWP 5265) 0x0000002a9589f8da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 3 Thread 1147169120 (LWP 5266) 0x0000002a9589facf in > pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 2 Thread 1157658976 (LWP 5267) 0x0000002a95d7fd22 in poll () > from /lib64/tls/libc.so.6 > 1 Thread 182904123744 (LWP 5237) 0x0000002a95d51d65 in __nanosleep_nocancel > () from /lib64/tls/libc.so.6 > (gdb) thread 1 > [Switching to thread 1 (Thread 182904123744 (LWP 5237))]#0 > 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > (gdb) bt > #0 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > #1 0x0000002a95d82368 in usleep () from /lib64/tls/libc.so.6 > #2 0x0000002a9578a05e in cl_thread_suspend (pause_ms=10000) at > cl_thread.c:125 > #3 0x0000000000405ba1 in main () > (gdb) thread 2 > [Switching to thread 2 (Thread 1157658976 (LWP 5267))]#0 0x0000002a95d7fd22 > in poll () from /lib64/tls/libc.so.6 > (gdb) bt > #0 0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6 > #1 0x0000002a95bbb94d in dev_poll (fd=Variable "fd" is not available. > ) at src/umad.c:775 > #2 0x0000002a95bbba6d in umad_recv (portid=Variable "portid" is not > available. > ) at src/umad.c:805 > #3 0x0000002a959ae68b in umad_receiver (p_ptr=0x5c3000) > at osm_vendor_ibumad.c:266 > #4 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5c3070) at cl_thread.c:61 > #5 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #6 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #7 0x0000000000000000 in ?? () > (gdb) thread 3 > [Switching to thread 3 (Thread 1147169120 (LWP 5266))]#0 0x0000002a9589facf > in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a95783b4b in cl_event_wait_on (p_event=0x5887b8, > wait_us=10000000, interruptible=1) at cl_event.c:181 > #2 0x000000000043630c in __osm_sm_sweeper () > #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x588898) at cl_thread.c:61 > #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 4 > [Switching to thread 4 (Thread 1136679264 (LWP 5265))]#0 0x0000002a9589f8da > in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a278, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x000000000044d7a1 in __osm_vl15_poller () > #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58a2e8) at cl_thread.c:61 > #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 5 > [Switching to thread 5 (Thread 1126189408 (LWP 5264))]#0 0x0000002a9589f8da > in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488) > at cl_threadpool.c:71 > #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5900e0) at cl_thread.c:61 > #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 6 > [Switching to thread 6 (Thread 1115699552 (LWP 5263))]#0 0x0000002a9589f8da > in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488) > at cl_threadpool.c:71 > #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x590010) at cl_thread.c:61 > #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 7 > [Switching to thread 7 (Thread 1105209696 (LWP 5262))]#0 0x0000002a9589f8da > in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488) > at cl_threadpool.c:71 > #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58ff40) at cl_thread.c:61 > #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 8 > [Switching to thread 8 (Thread 1094719840 (LWP 5261))]#0 0x0000002a9589f8da > in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488) > at cl_threadpool.c:71 > #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58b760) at cl_thread.c:61 > #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 9 > [Switching to thread 9 (Thread 1084229984 (LWP 5253))]#0 0x0000002a9589facf > in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9578a9dd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168 > #2 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0 > #3 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6 > #4 0x0000000000000000 in ?? () > (gdb) thread 10 > Thread ID 10 not known. > (gdb) > Thread ID 10 not known. > (gdb) q > The program is running. Quit anyway (and detach it)? (y or n) y > Detaching from program: /usr/bin/opensm, process 5237 > [EMAIL PROTECTED] ~]# > > > > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
