-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 15/11/12 22:02, Bogdan Costescu wrote:
> This is not really a crash... it actually tells you politely that
> it couldn't reach other ranks and terminates. The following lines:
>
> Process 1 ([[5187,1],1]) is on host: node24 Process 2
> ([[5187,1],0]) is on host: node32 BTLs attempted: self sm
>
> mean that the only qualified to continue BTLs were self and sm,
> none of which allows inter-node communications. Very likely tcp
> (which you disabled) was the only inter-node BTL available. So now
> it's up to you to find out why openib BTL could not be selected...
As Bogdan says you really need to investigate the IB on those two
nodes to see whether they are working or not.
Running ibstatus is probably a good start, to check that the card is
happily talking to the fabric, e.g.:
[root@merri001 ~]# ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0002:c903:0007:3d51
base lid: 0x5c
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
There's also ibstat which gives you a bit more verbose info.
cheers,
Chris
- --
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: [email protected] Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/
iEYEARECAAYFAlCpoK8ACgkQO2KABBYQAh8UawCfeemGfxREQTjInM0KyVz0oUhv
l/sAnjbgSMUfIc3q0cjJ47UZkF2DWoui
=CPT2
-----END PGP SIGNATURE-----
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf