[ewg] /dev/infiniband/rdma_cm not created
I'm running on rhel4u6 with the 1.4.1 nightly from last night and sometimes /dev/infiniband/rdma_cm is not created. I can see its entry in /etc/udev/rules.d/90-ib.rules: KERNEL=umad*, NAME=infiniband/%k KERNEL=issm*, NAME=infiniband/%k KERNEL=ucm*, NAME=infiniband/%k, MODE=0666 KERNEL=uverbs*, NAME=infiniband/%k, MODE=0666 KERNEL=ucma, NAME=infiniband/%k, MODE=0666 KERNEL=rdma_cm, NAME=infiniband/%k, MODE=0666 But only some of these are created: [11:29] svbu-mpi005:/etc/udev/rules.d % l /dev/infiniband/ total 0 drwxr-xr-x 2 root root 120 May 13 02:39 ./ drwxr-xr-x 10 root root 5740 May 13 09:39 ../ crw--- 1 root root 231, 64 May 13 02:39 issm0 crw--- 1 root root 231, 0 May 13 02:39 umad0 crw-rw-rw- 1 root root 231, 192 May 13 02:39 uverbs0 crw-rw-rw- 1 root root 231, 193 May 13 02:39 uverbs1 [11:29] svbu-mpi005:/etc/udev/rules.d % I have both an IB HCA and an iWARP RNIC in this server: hca_id: mthca0 fw_ver: 1.2.917 node_guid: 0005:ad00:0008:bd60 sys_image_guid: 0005:ad00:0100:d050 vendor_id: 0x05ad vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0120002 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 2 port_lid: 34 port_lmc: 0x00 hca_id: nes0 node_guid: 0012:5502:b58c: sys_image_guid: 0012:5502:b58c: vendor_id: 0x1255 vendor_part_id: 256 hw_ver: 0x5 board_id: NES020 Board ID phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 0 port_lid: 1 port_lmc: 0x00 I don't see any obvious errors occurring in syslog or dmesg. What could cause this failure? -- Jeff Squyres Cisco Systems ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
RE: [ewg] /dev/infiniband/rdma_cm not created
Is the driver loaded ? ie., do an /sbin/lsmod to see. Also are there any messages that would indicate a problem when you do a dmesg. -Original Message- From: ewg-boun...@lists.openfabrics.org [mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of Jeff Squyres Sent: Wednesday, May 13, 2009 11:34 AM To: OpenFabrics General; OpenFabrics EWG Subject: [ewg] /dev/infiniband/rdma_cm not created I'm running on rhel4u6 with the 1.4.1 nightly from last night and sometimes /dev/infiniband/rdma_cm is not created. I can see its entry in /etc/udev/rules.d/90-ib.rules: KERNEL=umad*, NAME=infiniband/%k KERNEL=issm*, NAME=infiniband/%k KERNEL=ucm*, NAME=infiniband/%k, MODE=0666 KERNEL=uverbs*, NAME=infiniband/%k, MODE=0666 KERNEL=ucma, NAME=infiniband/%k, MODE=0666 KERNEL=rdma_cm, NAME=infiniband/%k, MODE=0666 But only some of these are created: [11:29] svbu-mpi005:/etc/udev/rules.d % l /dev/infiniband/ total 0 drwxr-xr-x 2 root root 120 May 13 02:39 ./ drwxr-xr-x 10 root root 5740 May 13 09:39 ../ crw--- 1 root root 231, 64 May 13 02:39 issm0 crw--- 1 root root 231, 0 May 13 02:39 umad0 crw-rw-rw- 1 root root 231, 192 May 13 02:39 uverbs0 crw-rw-rw- 1 root root 231, 193 May 13 02:39 uverbs1 [11:29] svbu-mpi005:/etc/udev/rules.d % I have both an IB HCA and an iWARP RNIC in this server: hca_id: mthca0 fw_ver: 1.2.917 node_guid: 0005:ad00:0008:bd60 sys_image_guid: 0005:ad00:0100:d050 vendor_id: 0x05ad vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0120002 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 2 port_lid: 34 port_lmc: 0x00 hca_id: nes0 node_guid: 0012:5502:b58c: sys_image_guid: 0012:5502:b58c: vendor_id: 0x1255 vendor_part_id: 256 hw_ver: 0x5 board_id: NES020 Board ID phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 0 port_lid: 1 port_lmc: 0x00 I don't see any obvious errors occurring in syslog or dmesg. What could cause this failure? -- Jeff Squyres Cisco Systems ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] /dev/infiniband/rdma_cm not created
On May 13, 2009, at 2:39 PM, Woodruff, Robert J wrote: Is the driver loaded ? ie., do an /sbin/lsmod to see. Ah ha -- no, it is not: [11:51] svbu-mpi005:/etc/udev/rules.d % /sbin/lsmod | grep rdma [11:51] svbu-mpi005:/etc/udev/rules.d % What would cause it to not be loaded? I *assumed* (but didn't check) that it is loaded as part of OFED's /etc/init.d/openibd. Is that correct? Also are there any messages that would indicate a problem when you do a dmesg. As I indicated in my first mail :-), no. -Original Message- From: ewg-boun...@lists.openfabrics.org [mailto:ewg-boun...@lists.openfabrics.org ] On Behalf Of Jeff Squyres Sent: Wednesday, May 13, 2009 11:34 AM To: OpenFabrics General; OpenFabrics EWG Subject: [ewg] /dev/infiniband/rdma_cm not created I'm running on rhel4u6 with the 1.4.1 nightly from last night and sometimes /dev/infiniband/rdma_cm is not created. I can see its entry in /etc/udev/rules.d/90-ib.rules: KERNEL=umad*, NAME=infiniband/%k KERNEL=issm*, NAME=infiniband/%k KERNEL=ucm*, NAME=infiniband/%k, MODE=0666 KERNEL=uverbs*, NAME=infiniband/%k, MODE=0666 KERNEL=ucma, NAME=infiniband/%k, MODE=0666 KERNEL=rdma_cm, NAME=infiniband/%k, MODE=0666 But only some of these are created: [11:29] svbu-mpi005:/etc/udev/rules.d % l /dev/infiniband/ total 0 drwxr-xr-x 2 root root 120 May 13 02:39 ./ drwxr-xr-x 10 root root 5740 May 13 09:39 ../ crw--- 1 root root 231, 64 May 13 02:39 issm0 crw--- 1 root root 231, 0 May 13 02:39 umad0 crw-rw-rw- 1 root root 231, 192 May 13 02:39 uverbs0 crw-rw-rw- 1 root root 231, 193 May 13 02:39 uverbs1 [11:29] svbu-mpi005:/etc/udev/rules.d % I have both an IB HCA and an iWARP RNIC in this server: hca_id: mthca0 fw_ver: 1.2.917 node_guid: 0005:ad00:0008:bd60 sys_image_guid: 0005:ad00:0100:d050 vendor_id: 0x05ad vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0120002 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 2 port_lid: 34 port_lmc: 0x00 hca_id: nes0 node_guid: 0012:5502:b58c: sys_image_guid: 0012:5502:b58c: vendor_id: 0x1255 vendor_part_id: 256 hw_ver: 0x5 board_id: NES020 Board ID phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 0 port_lid: 1 port_lmc: 0x00 I don't see any obvious errors occurring in syslog or dmesg. What could cause this failure? -- Jeff Squyres Cisco Systems ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Jeff Squyres Cisco Systems ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] /dev/infiniband/rdma_cm not created
On May 13, 2009, at 2:54 PM, Jeff Squyres wrote: [11:51] svbu-mpi005:/etc/udev/rules.d % /sbin/lsmod | grep rdma [11:51] svbu-mpi005:/etc/udev/rules.d % What would cause it to not be loaded? I *assumed* (but didn't check) that it is loaded as part of OFED's /etc/init.d/openibd. Is that correct? FWIW, I see the following in /etc/infiniband/openibd.conf: # Start HCA driver upon boot ONBOOT=yes #... # Load RDMA_CM module RDMA_CM_LOAD=yes -- Jeff Squyres Cisco Systems ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
RE: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created
FWIW, I see the following in /etc/infiniband/openibd.conf: # Load RDMA_CM module RDMA_CM_LOAD=yes is RDMA_UCM_LOAD=yes ? What do you see with modinfo rdma_cm rdma_ucm ?___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created
On May 13, 2009, at 3:03 PM, Davis, Arlin R wrote: FWIW, I see the following in /etc/infiniband/openibd.conf: # Load RDMA_CM module RDMA_CM_LOAD=yes is RDMA_UCM_LOAD=yes ? Yes, sorry I didn't see that one first time around: # Load RDMA_UCM module RDMA_UCM_LOAD=yes What do you see with modinfo rdma_cm rdma_ucm ? [r...@svbu-mpi055 ~]# modinfo rdma_cm rdma_ucm filename: /lib/modules/2.6.9-67.ELsmp/updates/kernel/drivers/ infiniband/core/rdma_cm.ko parm: cma_response_timeout:CMA_CM_RESPONSE_TIMEOUT default=20 parm: unify_tcp_port_space:Unify the host TCP and RDMA port space allocation (default=0) parm: tavor_quirk:Tavor performance quirk: limit MTU to 1K if 0 license:Dual BSD/GPL description:Generic RDMA CM Agent author: Sean Hefty depends:ib_addr,ib_cm,iw_cm,ib_core,ib_sa vermagic: 2.6.9-67.ELsmp SMP gcc-3.4 filename: /lib/modules/2.6.9-67.ELsmp/updates/kernel/drivers/ infiniband/core/rdma_ucm.ko license:Dual BSD/GPL description:RDMA Userspace Connection Manager Access author: Sean Hefty depends:rdma_cm,ib_uverbs,ib_core,rdma_cm vermagic: 2.6.9-67.ELsmp SMP gcc-3.4 [r...@svbu-mpi055 ~]# -- Jeff Squyres Cisco Systems ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created
Ok, I figured it out. I have some creative /etc/sysconfig/network- script/ifcfg-ib* scripts that may choose to do nothing if no device is present (or some other esoteric, specific-to-jeffs-cluster criteria is met) -- they call exit 0 in this case. This apparently causes the top-level /etc/init.d/openibd to exit (!). I've fixed this (they now never call exit); now everything works as expected. Upon reflection, I can see that this was totally my fault -- ifcfg-* scripts are always sourced and should therefore never call exit. But given that /etc/init.d/openib is sooo complex and has sooo many moving parts, it would be nice if there were a way to track down problems a little more easily; perhaps a verbose setting in /etc/ infiniband/openibd.conf, or somesuch. Indeed, since OFED is targeted at the datacenter, monitors attached to the servers in question and/or serial consoles may not be readily available. Hence, having the ability to drop some verbose output into syslog during boot, for example, might be quite useful to sysadmins/network admins when troubleshooting. Just my $0.02. Thanks for the tips where to look, Woody! On May 13, 2009, at 3:18 PM, Jeff Squyres (jsquyres) wrote: On May 13, 2009, at 3:12 PM, Woodruff, Robert J wrote: Check to see if some other driver failed to load. I think I have seen before that if another driver fails to load, the start script bails out and does not load the other drivers. Perhaps try doing a /etc/init.d/openibd restart manually to see if something is failing to load. Weird -- doing it manually shows no problem: [r...@svbu-mpi055 ~]# /etc/init.d/openibd restart Unloading HCA driver: [ OK ] Loading HCA driver and Access Layer: [ OK ] Setting up InfiniBand network interfaces: Bringing up interface ib0: [ OK ] Bringing up interface ib1: [ OK ] Setting up service network . . . [ done ] [r...@svbu-mpi055 ~]# ls -l /dev/infiniband/rdma_cm crw-rw-rw- 1 root root 10, 62 May 13 12:17 /dev/infiniband/rdma_cm [r...@svbu-mpi055 ~]# Something must be going wrong during the bootup. I'm unfortunately several thousand miles from the server and don't have a serial console. I guess I'll insert some initlog's in /etc/init.d/openibd... -- Jeff Squyres Cisco Systems ___ general mailing list gene...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Jeff Squyres Cisco Systems ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg