Hi Kevin,

I have not personally ever seen this specific error.

However I have found this issue that looks similar to your problem:
Transport retry count exceeded on mlx5_0:1/IB -- 
uct_ib_mlx5_completion_with_err() * Issue #6669 * openucx/ucx 
(github.com)<https://github.com/openucx/ucx/issues/6669>

Although it is not solved yet, it might be good to track it and, if it is the 
case, chime into that issue.

Keep us posted on the progress.

Davide


From: [email protected] <[email protected]> On 
Behalf Of Glenn (Gedaliah) Wolosh
Sent: Friday, July 2, 2021 2:33 PM
To: [email protected]
Cc: Kevin Walsh <[email protected]>
Subject: [EXTERNAL] [easybuild] Infiniband issues with openfoam and paraview.

All,

I installed ParaView/5.8.0-Python-3.8.2-mpi and OpenFOAM/8. This is with  
foss/2020a

Running mpi on a single node works fine, however with multiple node it crashes 
as follows:

[node801:25722:0:25722] ib_mlx5_log.c:132  Transport retry count exceeded on 
mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[node801:25722:0:25722] ib_mlx5_log.c:132  DCI QP 0x3685 wqe[20]: SEND s-e 
[rqpn 0x12c88 rlid 1] [va 0x37fd600 len 6711 lkey 0x2c188]
==== backtrace (tid:  25722) ====
 0 0x00000000000214ae ucs_debug_print_backtrace()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653
 1 0x000000000001ff00 uct_ib_mlx5_completion_with_err()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5_log.c:132
 2 0x000000000005982e uct_ib_mlx5_poll_cq()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5.inl:81
 3 0x000000000005982e uct_dc_mlx5_iface_progress()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/dc/dc_mlx5.c:238
 4 0x0000000000027bba ucs_callbackq_dispatch()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/datastruct/callbackq.h:211
 5 0x0000000000027bba uct_worker_progress()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/api/uct.h:2221
 6 0x0000000000027bba ucp_worker_progress()  
/home/e/easybuild/.local/easybuild/build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucp/core/ucp_worker.c:1951
 7 0x0000000000003c27 mca_pml_ucx_progress()  ???:0
 8 0x000000000002da7b opal_progress()  ???:0
 9 0x00000000000550b5 ompi_request_default_wait()  ???:0

The IB card on these nodes are:

Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]

Please note the following:

1) This error is not occurring with paraview binary package which uses an MPICH 
mpi installed with the package
2) On nodes with a ConnectX-5 IB card it works fine on multiple nodes

Has anybody seen this before.



[NJIT 
logo]<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.njit.edu%2F&data=04%7C01%7Cdavide.vanzo%40microsoft.com%7C7cac9f2f709f46bacebc08d93d906f23%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637608512836598384%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=gPRM4TRxsxX35%2Be6RnOuUJYTL1ThpFTiyN6xC4kO5ns%3D&reserved=0>
Glenn (Gedaliah) Wolosh, Ph.D.
Ass't Director Research Software and Cloud Computing
Acad & Research Computing Systems
[email protected]<mailto:[email protected]> * (973) 596-5437<tel:(973)%20596-5437>

A Top 100 National University
U.S. News & World Report




Reply via email to