[ewg] /dev/infiniband/rdma_cm not created

2009-05-13 Thread Jeff Squyres
I'm running on rhel4u6 with the 1.4.1 nightly from last night and  
sometimes /dev/infiniband/rdma_cm is not created.  I can see its entry  
in /etc/udev/rules.d/90-ib.rules:


KERNEL=umad*, NAME=infiniband/%k
KERNEL=issm*, NAME=infiniband/%k
KERNEL=ucm*, NAME=infiniband/%k, MODE=0666
KERNEL=uverbs*, NAME=infiniband/%k, MODE=0666
KERNEL=ucma, NAME=infiniband/%k, MODE=0666
KERNEL=rdma_cm, NAME=infiniband/%k, MODE=0666

But only some of these are created:

[11:29] svbu-mpi005:/etc/udev/rules.d % l /dev/infiniband/
total 0
drwxr-xr-x   2 root root  120 May 13 02:39 ./
drwxr-xr-x  10 root root 5740 May 13 09:39 ../
crw---   1 root root 231,  64 May 13 02:39 issm0
crw---   1 root root 231,   0 May 13 02:39 umad0
crw-rw-rw-   1 root root 231, 192 May 13 02:39 uverbs0
crw-rw-rw-   1 root root 231, 193 May 13 02:39 uverbs1
[11:29] svbu-mpi005:/etc/udev/rules.d %

I have both an IB HCA and an iWARP RNIC in this server:

hca_id: mthca0
fw_ver: 1.2.917
node_guid:  0005:ad00:0008:bd60
sys_image_guid: 0005:ad00:0100:d050
vendor_id:  0x05ad
vendor_part_id: 25204
hw_ver: 0xA0
board_id:   MT_03B0120002
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 2
port_lid:   34
port_lmc:   0x00

hca_id: nes0
node_guid:  0012:5502:b58c:
sys_image_guid: 0012:5502:b58c:
vendor_id:  0x1255
vendor_part_id: 256
hw_ver: 0x5
board_id:   NES020 Board ID
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid:   1
port_lmc:   0x00

I don't see any obvious errors occurring in syslog or dmesg.

What could cause this failure?

--
Jeff Squyres
Cisco Systems

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


RE: [ewg] /dev/infiniband/rdma_cm not created

2009-05-13 Thread Woodruff, Robert J
Is the driver loaded ? ie., do an /sbin/lsmod to see.

Also are there any messages that would indicate a 
problem when you do a dmesg.



-Original Message-
From: ewg-boun...@lists.openfabrics.org 
[mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of Jeff Squyres
Sent: Wednesday, May 13, 2009 11:34 AM
To: OpenFabrics General; OpenFabrics EWG
Subject: [ewg] /dev/infiniband/rdma_cm not created

I'm running on rhel4u6 with the 1.4.1 nightly from last night and  
sometimes /dev/infiniband/rdma_cm is not created.  I can see its entry  
in /etc/udev/rules.d/90-ib.rules:

KERNEL=umad*, NAME=infiniband/%k
KERNEL=issm*, NAME=infiniband/%k
KERNEL=ucm*, NAME=infiniband/%k, MODE=0666
KERNEL=uverbs*, NAME=infiniband/%k, MODE=0666
KERNEL=ucma, NAME=infiniband/%k, MODE=0666
KERNEL=rdma_cm, NAME=infiniband/%k, MODE=0666

But only some of these are created:

[11:29] svbu-mpi005:/etc/udev/rules.d % l /dev/infiniband/
total 0
drwxr-xr-x   2 root root  120 May 13 02:39 ./
drwxr-xr-x  10 root root 5740 May 13 09:39 ../
crw---   1 root root 231,  64 May 13 02:39 issm0
crw---   1 root root 231,   0 May 13 02:39 umad0
crw-rw-rw-   1 root root 231, 192 May 13 02:39 uverbs0
crw-rw-rw-   1 root root 231, 193 May 13 02:39 uverbs1
[11:29] svbu-mpi005:/etc/udev/rules.d %

I have both an IB HCA and an iWARP RNIC in this server:

hca_id: mthca0
fw_ver: 1.2.917
node_guid:  0005:ad00:0008:bd60
sys_image_guid: 0005:ad00:0100:d050
vendor_id:  0x05ad
vendor_part_id: 25204
hw_ver: 0xA0
board_id:   MT_03B0120002
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 2
port_lid:   34
port_lmc:   0x00

hca_id: nes0
node_guid:  0012:5502:b58c:
sys_image_guid: 0012:5502:b58c:
vendor_id:  0x1255
vendor_part_id: 256
hw_ver: 0x5
board_id:   NES020 Board ID
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid:   1
port_lmc:   0x00

I don't see any obvious errors occurring in syslog or dmesg.

What could cause this failure?

-- 
Jeff Squyres
Cisco Systems

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] /dev/infiniband/rdma_cm not created

2009-05-13 Thread Jeff Squyres

On May 13, 2009, at 2:39 PM, Woodruff, Robert J wrote:


Is the driver loaded ? ie., do an /sbin/lsmod to see.



Ah ha -- no, it is not:

[11:51] svbu-mpi005:/etc/udev/rules.d % /sbin/lsmod | grep rdma
[11:51] svbu-mpi005:/etc/udev/rules.d %

What would cause it to not be loaded?  I *assumed* (but didn't check)  
that it is loaded as part of OFED's /etc/init.d/openibd.  Is that  
correct?



Also are there any messages that would indicate a
problem when you do a dmesg.



As I indicated in my first mail :-), no.





-Original Message-
From: ewg-boun...@lists.openfabrics.org [mailto:ewg-boun...@lists.openfabrics.org 
] On Behalf Of Jeff Squyres

Sent: Wednesday, May 13, 2009 11:34 AM
To: OpenFabrics General; OpenFabrics EWG
Subject: [ewg] /dev/infiniband/rdma_cm not created

I'm running on rhel4u6 with the 1.4.1 nightly from last night and
sometimes /dev/infiniband/rdma_cm is not created.  I can see its entry
in /etc/udev/rules.d/90-ib.rules:

KERNEL=umad*, NAME=infiniband/%k
KERNEL=issm*, NAME=infiniband/%k
KERNEL=ucm*, NAME=infiniband/%k, MODE=0666
KERNEL=uverbs*, NAME=infiniband/%k, MODE=0666
KERNEL=ucma, NAME=infiniband/%k, MODE=0666
KERNEL=rdma_cm, NAME=infiniband/%k, MODE=0666

But only some of these are created:

[11:29] svbu-mpi005:/etc/udev/rules.d % l /dev/infiniband/
total 0
drwxr-xr-x   2 root root  120 May 13 02:39 ./
drwxr-xr-x  10 root root 5740 May 13 09:39 ../
crw---   1 root root 231,  64 May 13 02:39 issm0
crw---   1 root root 231,   0 May 13 02:39 umad0
crw-rw-rw-   1 root root 231, 192 May 13 02:39 uverbs0
crw-rw-rw-   1 root root 231, 193 May 13 02:39 uverbs1
[11:29] svbu-mpi005:/etc/udev/rules.d %

I have both an IB HCA and an iWARP RNIC in this server:

hca_id: mthca0
fw_ver: 1.2.917
node_guid:  0005:ad00:0008:bd60
sys_image_guid: 0005:ad00:0100:d050
vendor_id:  0x05ad
vendor_part_id: 25204
hw_ver: 0xA0
board_id:   MT_03B0120002
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 2
port_lid:   34
port_lmc:   0x00

hca_id: nes0
node_guid:  0012:5502:b58c:
sys_image_guid: 0012:5502:b58c:
vendor_id:  0x1255
vendor_part_id: 256
hw_ver: 0x5
board_id:   NES020 Board ID
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid:   1
port_lmc:   0x00

I don't see any obvious errors occurring in syslog or dmesg.

What could cause this failure?

--
Jeff Squyres
Cisco Systems

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg



--
Jeff Squyres
Cisco Systems

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] /dev/infiniband/rdma_cm not created

2009-05-13 Thread Jeff Squyres

On May 13, 2009, at 2:54 PM, Jeff Squyres wrote:


[11:51] svbu-mpi005:/etc/udev/rules.d % /sbin/lsmod | grep rdma
[11:51] svbu-mpi005:/etc/udev/rules.d %

What would cause it to not be loaded?  I *assumed* (but didn't  
check) that it is loaded as part of OFED's /etc/init.d/openibd.  Is  
that correct?



FWIW, I see the following in /etc/infiniband/openibd.conf:

# Start HCA driver upon boot
ONBOOT=yes

#...

# Load RDMA_CM module
RDMA_CM_LOAD=yes

--
Jeff Squyres
Cisco Systems

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


RE: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created

2009-05-13 Thread Davis, Arlin R
 
FWIW, I see the following in /etc/infiniband/openibd.conf:


# Load RDMA_CM module
RDMA_CM_LOAD=yes


is RDMA_UCM_LOAD=yes ?

What do you see with modinfo rdma_cm rdma_ucm 
?___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created

2009-05-13 Thread Jeff Squyres

On May 13, 2009, at 3:03 PM, Davis, Arlin R wrote:


FWIW, I see the following in /etc/infiniband/openibd.conf:


# Load RDMA_CM module
RDMA_CM_LOAD=yes

is RDMA_UCM_LOAD=yes ?



Yes, sorry I didn't see that one first time around:

# Load RDMA_UCM module
RDMA_UCM_LOAD=yes


What do you see with modinfo rdma_cm rdma_ucm ?


[r...@svbu-mpi055 ~]# modinfo rdma_cm rdma_ucm
filename:   /lib/modules/2.6.9-67.ELsmp/updates/kernel/drivers/ 
infiniband/core/rdma_cm.ko

parm:   cma_response_timeout:CMA_CM_RESPONSE_TIMEOUT default=20
parm:   unify_tcp_port_space:Unify the host TCP and RDMA port  
space allocation (default=0)
parm:   tavor_quirk:Tavor performance quirk: limit MTU to 1K  
if  0

license:Dual BSD/GPL
description:Generic RDMA CM Agent
author: Sean Hefty
depends:ib_addr,ib_cm,iw_cm,ib_core,ib_sa
vermagic:   2.6.9-67.ELsmp SMP gcc-3.4
filename:   /lib/modules/2.6.9-67.ELsmp/updates/kernel/drivers/ 
infiniband/core/rdma_ucm.ko

license:Dual BSD/GPL
description:RDMA Userspace Connection Manager Access
author: Sean Hefty
depends:rdma_cm,ib_uverbs,ib_core,rdma_cm
vermagic:   2.6.9-67.ELsmp SMP gcc-3.4
[r...@svbu-mpi055 ~]#


--
Jeff Squyres
Cisco Systems

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] Re: [ewg] /dev/infiniband/rdma_cm not created

2009-05-13 Thread Jeff Squyres
Ok, I figured it out.  I have some creative /etc/sysconfig/network- 
script/ifcfg-ib* scripts that may choose to do nothing if no device is  
present (or some other esoteric, specific-to-jeffs-cluster criteria is  
met) -- they call exit 0 in this case.  This apparently causes the  
top-level /etc/init.d/openibd to exit (!).  I've fixed this (they now  
never call exit); now everything works as expected.


Upon reflection, I can see that this was totally my fault -- ifcfg-*  
scripts are always sourced and should therefore never call exit.


But given that /etc/init.d/openib is sooo complex and has sooo many  
moving parts, it would be nice if there were a way to track down  
problems a little more easily; perhaps a verbose setting in /etc/ 
infiniband/openibd.conf, or somesuch.  Indeed, since OFED is targeted  
at the datacenter, monitors attached to the servers in question and/or  
serial consoles may not be readily available.  Hence, having the  
ability to drop some verbose output into syslog during boot, for  
example, might be quite useful to sysadmins/network admins when  
troubleshooting.


Just my $0.02.

Thanks for the tips where to look, Woody!



On May 13, 2009, at 3:18 PM, Jeff Squyres (jsquyres) wrote:


On May 13, 2009, at 3:12 PM, Woodruff, Robert J wrote:

 Check to see if some other driver failed to load.
 I think I have seen before that if another driver
 fails to load, the start script bails out and
 does not load the other drivers.

 Perhaps try doing a /etc/init.d/openibd restart
 manually to see if something is failing to load.


Weird -- doing it manually shows no problem:

[r...@svbu-mpi055 ~]# /etc/init.d/openibd restart
Unloading HCA driver:  [  OK  ]
Loading HCA driver and Access Layer:   [  OK  ]
Setting up InfiniBand network interfaces:
Bringing up interface ib0: [  OK  ]
Bringing up interface ib1: [  OK  ]
Setting up service network . . .   [  done  ]
[r...@svbu-mpi055 ~]# ls -l /dev/infiniband/rdma_cm
crw-rw-rw-  1 root root 10, 62 May 13 12:17 /dev/infiniband/rdma_cm
[r...@svbu-mpi055 ~]#

Something must be going wrong during the bootup.  I'm unfortunately
several thousand miles from the server and don't have a serial
console.  I guess I'll insert some initlog's in /etc/init.d/openibd...

--
Jeff Squyres
Cisco Systems

___
general mailing list
gene...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



--
Jeff Squyres
Cisco Systems

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg