Hello,
I'm trying to find a bug in our configuration, which causes the the IB fabric
or at least the port where the OpenSM is running to crash. I hope someone on
this list has more experience and can help, or give me a hint.
The configuration:
a) HCAs: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB
DDR / 10GigE]; or Voltaire (ibv_devinfo shows board_id: VLT0130010001, fw_ver:
2.3.000)
b) OFED 3.5 rc2
c) OpenSM with DFSSSP routing algorithm running on a compute node (additinal
OpenSM on a switch with lower priority)
d) OpenMPI runs are executed with "--mca
btl_openib_ib_path_record_service_level 1"
e) kernel 2.6.32-220.13.1.el6.x86_64
As far as I understand the whole system:
1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the
OpenSM
2. the SA receives the request on QP1
3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a
special service level for the slid/dlid path
4. SA sends the PathRecord back to the OMPI process via umad_send in
libvendor/osm_vendor_ibumad.c
The osm_vendor_send() function builds the MAD packet with the following
attributes:
/* GS classes */
umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
p_mad_addr->addr_type.gsi.remote_qp,
p_mad_addr->addr_type.gsi.service_level,
IB_QP1_WELL_KNOWN_Q_KEY);
So, the SL is the same like the one which was used by the OMPI process. The
Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is
correct, too.
Afterwards umad_send(…) is used to send the reply with the PathRecord, and this
send does not work (except for SL=0).
If I look into the MAD before it is send, then it looks like this:
Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120,
timeout_ms=0, retries=3)
at src/umad.c:791
791 if (umaddebug > 1)
(gdb) p *mad
$1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr =
{qpn = 1325427712, qkey = 384,
lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000',
gid_index = 0 '\000',
hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15
times>, flow_label = 0,
pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530
"\002"}
The kernel writes the following messages after a short time into the log:
Dec 14 01:23:46 rc001 kernel: INFO: task opensm:2499 blocked for more than 120
seconds.
Dec 14 01:23:46 rc001 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 14 01:23:46 rc001 kernel: opensm D 0000000000000000 0 2499
2498 0x00000000
Dec 14 01:23:46 rc001 kernel: ffff880424bebc38 0000000000000082
0000000000000000 0000000000000000
Dec 14 01:23:46 rc001 kernel: 0000000000000000 ffff8804ffffffff
ffff88042287eec0 0000000031bc502d
Dec 14 01:23:46 rc001 kernel: ffff880427fca678 ffff880424bebfd8
000000000000f4e8 ffff880427fca678
Dec 14 01:23:46 rc001 kernel: Call Trace:
Dec 14 01:23:46 rc001 kernel: [<ffffffff814eddc5>] schedule_timeout+0x215/0x2e0
Dec 14 01:23:46 rc001 kernel: [<ffffffff8109698f>] ? up+0x2f/0x50
Dec 14 01:23:46 rc001 kernel: [<ffffffffa00fb8d2>] ? __mlx4_cmd+0x202/0x300
[mlx4_core]
Dec 14 01:23:46 rc001 kernel: [<ffffffff814eda43>] wait_for_common+0x123/0x180
Dec 14 01:23:46 rc001 kernel: [<ffffffff8105e940>] ?
default_wake_function+0x0/0x20
Dec 14 01:23:46 rc001 kernel: [<ffffffff814edb5d>] wait_for_completion+0x1d/0x20
Dec 14 01:23:46 rc001 kernel: [<ffffffffa0e1913a>]
ib_unregister_mad_agent+0x33a/0x500 [ib_mad]
Dec 14 01:23:46 rc001 kernel: [<ffffffffa0d9f923>]
ib_umad_unreg_agent+0xb3/0xe0 [ib_umad]
Dec 14 01:23:46 rc001 kernel: [<ffffffffa0d9fa37>] ib_umad_ioctl+0x67/0x70
[ib_umad]
Dec 14 01:23:46 rc001 kernel: [<ffffffff81189582>] vfs_ioctl+0x22/0xa0
Dec 14 01:23:46 rc001 kernel: [<ffffffff81141190>] ? unmap_region+0x110/0x130
Dec 14 01:23:46 rc001 kernel: [<ffffffff81189724>] do_vfs_ioctl+0x84/0x580
Dec 14 01:23:46 rc001 kernel: [<ffffffff8113f33e>] ? remove_vma+0x6e/0x90
Dec 14 01:23:46 rc001 kernel: [<ffffffff81141828>] ? do_munmap+0x308/0x3a0
Dec 14 01:23:46 rc001 kernel: [<ffffffff81189ca1>] sys_ioctl+0x81/0xa0
Dec 14 01:23:46 rc001 kernel: [<ffffffff8100b0f2>]
system_call_fastpath+0x16/0x1b
(Even "modprobe mlx4_core enable_qos=Y debug_level=1" does not make any
difference and I get the same output like the one above)
The output of OpenMPI or OpenSM's log file don't show any useful information
for this problem, even with higher debug levels.
The OpenSM does not really respond to ctrl+c and becomes a zombi process
afterwards, so that the only option is to reboot the node.
So, right now I'm stuck, and have no idea if there is an error in the kernel
driver, the HCA firmware or something completely different. Or if umad_send
basically does not support SL>0.
A workaround for the moment is to set the SL in the umad_set_addr_net(...) call
to 0.
Please let me know if you need more information, or if I can test something to
give you more inside.
Thank you in advance,
Jens
--------------------------------
Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku,
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: [email protected]
--------------------------------
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html