All,

I installed ParaView/5.8.0-Python-3.8.2-mpi and OpenFOAM/8. This is with  
foss/2020a

Running mpi on a single node works fine, however with multiple node it crashes 
as follows:

[node801:25722:0:25722] ib_mlx5_log.c:132  Transport retry count exceeded on 
mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[node801:25722:0:25722] ib_mlx5_log.c:132  DCI QP 0x3685 wqe[20]: SEND s-e 
[rqpn 0x12c88 rlid 1] [va 0x37fd600 len 6711 lkey 0x2c188]
==== backtrace (tid:  25722) ====
 0 0x00000000000214ae ucs_debug_print_backtrace()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653
 1 0x000000000001ff00 uct_ib_mlx5_completion_with_err()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5_log.c:132
 2 0x000000000005982e uct_ib_mlx5_poll_cq()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5.inl:81
 3 0x000000000005982e uct_dc_mlx5_iface_progress()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/dc/dc_mlx5.c:238
 4 0x0000000000027bba ucs_callbackq_dispatch()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/datastruct/callbackq.h:211
 5 0x0000000000027bba uct_worker_progress()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/api/uct.h:2221
 6 0x0000000000027bba ucp_worker_progress()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucp/core/ucp_worker.c:1951
 7 0x0000000000003c27 mca_pml_ucx_progress()  ???:0
 8 0x000000000002da7b opal_progress()  ???:0
 9 0x00000000000550b5 ompi_request_default_wait()  ???:0

The IB card on these nodes are:

Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]

Please note the following:

1) This error is not occurring with paraview binary package which uses an MPICH 
mpi installed with the package
2) On nodes with a ConnectX-5 IB card it works fine on multiple nodes

Has anybody seen this before.



 <https://www.njit.edu/>        Glenn (Gedaliah) Wolosh, Ph.D.
Ass't Director Research Software and Cloud Computing
Acad & Research Computing Systems
[email protected] <mailto:[email protected]> • (973) 596-5437 <tel:(973) 596-5437>

A Top 100 National University
U.S. News & World Report





Reply via email to