Hi folks,

Trying the devel list to see if folks here have hit this issue when
testing out as I suspect it's not something many users will have access
to yet.

We have an issue where codes compiled with Open-MPI kill nodes with
ConnectX-4 and ConnectX-5 cards connected to Mellanox Ethernet switches
using the mlx5 driver from the latest Mellanox OFED, the kernel hangs
with no oops (or any other error) and we have to power cycle the node to
get it back.

This happens with even a singleton (no srun or mpirun) and from what I
can see from strace before the node hangs Open-MPI is starting to probe
for what fabrics are available.

The folks I'm helping have engaged Mellanox support but I was wondering
if anyone else had run across this?

Distro: RHEL 7.4 (x86-64)
Kernel: 4.12.9 (needed for the CephFS filesystem they use)
OFED: 4.1-1.0.2.0
Open-MPI: 1.10.x, 2.0.2, 3.0.0

All the best,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to