Ok, I figured it out. I have some creative /etc/sysconfig/network-
script/ifcfg-ib* scripts that may choose to do nothing if no device is
present (or some other esoteric, specific-to-jeffs-cluster criteria is
met) -- they call exit 0 in this case. This apparently causes the
top-level /etc/init.d/openibd to exit (!). I've fixed this (they now
never call exit); now everything works as expected.
Upon reflection, I can see that this was totally my fault -- ifcfg-*
scripts are always sourced and should therefore never call exit.
But given that /etc/init.d/openib is sooo complex and has sooo many
moving parts, it would be nice if there were a way to track down
problems a little more easily; perhaps a verbose setting in /etc/
infiniband/openibd.conf, or somesuch. Indeed, since OFED is targeted
at the datacenter, monitors attached to the servers in question and/or
serial consoles may not be readily available. Hence, having the
ability to drop some verbose output into syslog during boot, for
example, might be quite useful to sysadmins/network admins when
troubleshooting.
Just my $0.02.
Thanks for the tips where to look, Woody!
On May 13, 2009, at 3:18 PM, Jeff Squyres (jsquyres) wrote:
On May 13, 2009, at 3:12 PM, Woodruff, Robert J wrote:
Check to see if some other driver failed to load.
I think I have seen before that if another driver
fails to load, the start script bails out and
does not load the other drivers.
Perhaps try doing a /etc/init.d/openibd restart
manually to see if something is failing to load.
Weird -- doing it manually shows no problem:
[r...@svbu-mpi055 ~]# /etc/init.d/openibd restart
Unloading HCA driver: [ OK ]
Loading HCA driver and Access Layer: [ OK ]
Setting up InfiniBand network interfaces:
Bringing up interface ib0: [ OK ]
Bringing up interface ib1: [ OK ]
Setting up service network . . . [ done ]
[r...@svbu-mpi055 ~]# ls -l /dev/infiniband/rdma_cm
crw-rw-rw- 1 root root 10, 62 May 13 12:17 /dev/infiniband/rdma_cm
[r...@svbu-mpi055 ~]#
Something must be going wrong during the bootup. I'm unfortunately
several thousand miles from the server and don't have a serial
console. I guess I'll insert some initlog's in /etc/init.d/openibd...
--
Jeff Squyres
Cisco Systems
___
general mailing list
gene...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
--
Jeff Squyres
Cisco Systems
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg