We have installed EasyBuild on another cluster running OpenHPC 1.3 (ohpc-release-1.3-2.el7.x86_64) with CentOS Linux release 7.6.1810. The hardware is AMD EPYC Naples and Rome, and the interconnect is Mellanox ConnectX-5 (Naples) and ConnectX-6 (Rome). The Mellanox drivers on the system seem to be /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/

We have built the EB module GPAW/21.1.0-foss-2020b-ASE-3.21.1 for our production code. The foss/2020b toolchain includes OpenMPI/4.0.5-GCC-10.2.0.

We have a user reporting that GPAW crashes on the AMD Rome nodes, but only when running on multiple nodes, i.e., when using the Mellanox interconnect. Interestingly, the crashes only occur on Rome nodes, not on Naples nodes (I built the EB modules on a Naples login node).

The Mellanox drivers issues lots of UCX warnings shown below. Eventually the code crashes in MPI collective operations with UCS error messages from many tasks as shown below.

Questions:

* Are there known issues with OpenMPI/4.0.5-GCC-10.2.0 and the old MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64 driver or and old CentOS 7.6 OS?

* Could the ConnectX-6 adapter on the Rome nodes cause problems because we built foss/2020b on a Naples node with ConnectX-5?

* Would an older foss toolchain be recommended on such an old OpenHPC 1.3 software stack?

Thanks a lot,
Ole


Details from the job output file
--------------------------------

[1624476455.369726] [sn537:2031 :0] uct_iface.c:66 UCX WARN got active message id 0, but no handler installed [1624476455.369729] [sn537:2031 :0] uct_iface.c:70 UCX WARN payload: [1624476455.369729] [sn537:2031 :0] uct_iface.c:70 UCX WARN 00000000:00000000:00000000:00000000 [1624476455.369729] [sn537:2031 :0] uct_iface.c:70 UCX WARN 00000000:00000000:00000000:00000000 [1624476455.369729] [sn537:2031 :0] uct_iface.c:70 UCX WARN 00000000:00000000:00000000:00000000 [1624476455.369729] [sn537:2031 :0] uct_iface.c:70 UCX WARN 00000000:00000000:00000000:00000000 [sn537:2026 :0:2026] ib_mlx5_log.c:139 Local protection on mlx5_0:1/IB (synd 0x4 vend 0x51 hw_synd 0/4) [sn537:2026 :0:2026] ib_mlx5_log.c:139 DCI QP 0x77bc wqe[1]: SEND s-e [rqpn 0x6b3a rlid 597] [va 0x2aaadbbfddc0 len 8192 lkey 0xffbf1]

Eventually the code crashes in MPI collective operations with UCS error messages such as:

==== backtrace (tid:   2026) ====
0 0x000000000004ec80 ucs_fatal_error_message() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/assert.c:33 1 0x00000000000532b5 ucs_log_default_handler() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:140 2 0x00000000000533e4 ucs_log_dispatch() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:191 3 0x000000000001c793 uct_ib_mlx5_completion_with_err() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5_log.c:132 4 0x0000000000039d86 uct_ib_mlx5_poll_cq() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5.inl:38 5 0x0000000000039d86 uct_dc_mlx5_iface_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/dc/dc_mlx5.c:238 6 0x000000000001fcf2 ucs_callbackq_dispatch() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/datastruct/callbackq.h:211 7 0x000000000001fcf2 uct_worker_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/api/uct.h:2203 8 0x000000000001fcf2 ucp_worker_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucp/core/ucp_worker.c:1897 9 0x00000000000036c7 mca_pml_ucx_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/pml/ucx/pml_ucx.c:515 10 0x000000000003682c opal_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/opal/runtime/opal_progress.c:231
11 0x00000000000bf179 wait_completion()  hcoll_collectives.c:0
12 0x000000000001e52e comm_allgather_hcolrte()  ???:0
13 0x00000000000138b7 hmca_bcol_ucx_p2p_init_query.part.4() bcol_ucx_p2p_component.c:0
14 0x00000000000cb86c hmca_bcol_base_init()  ???:0
15 0x000000000004a328 hmca_coll_ml_init_query()  ???:0
16 0x00000000000bff37 hcoll_init_with_opts()  ???:0
17 0x0000000000004ee3 mca_coll_hcoll_comm_query() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/coll/hcoll/coll_hcoll_module.c:292 18 0x000000000007881d query_2_0_0() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/coll/base/coll_base_comm_select.c:449 19 0x000000000007881d query() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/coll/base/coll_base_comm_select.c:432 20 0x000000000007881d check_one_component() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/coll/base/coll_base_comm_select.c:394 21 0x000000000007881d check_components() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/coll/base/coll_base_comm_select.c:344 22 0x000000000007881d mca_coll_base_comm_select() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/coll/base/coll_base_comm_select.c:126 23 0x00000000000ada2d ompi_mpi_init() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/runtime/ompi_mpi_init.c:957 24 0x000000000006aa29 PMPI_Init() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mpi/c/profile/pinit.c:69 25 0x000000000003bb69 mpi_ensure_initialized() /tmp/eb-hI2Q2G/pip-req-build-dhwtzdzz/c/mpi.c:233 26 0x000000000003bc2a NewMPIObject() /tmp/eb-hI2Q2G/pip-req-build-dhwtzdzz/c/mpi.c:1141 27 0x000000000013b3e6 type_call() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/typeobject.c:974 28 0x000000000013b201 _PyObject_MakeTpCall() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:159 29 0x00000000001387fc _PyEval_EvalFrameDefault() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:125 30 0x00000000001387fc ???() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:115 31 0x00000000001387fc ???() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4963 32 0x00000000001387fc ???() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:3469 33 0x0000000000132821 _PyEval_EvalCodeWithName() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4298 34 0x0000000000132539 PyEval_EvalCodeEx() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4327 35 0x00000000001a553b PyEval_EvalCode() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:718 36 0x00000000001aa8e5 builtin_exec() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/bltinmodule.c:1033 37 0x00000000001aa8e5 ???() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/clinic/bltinmodule.c.h:396 38 0x0000000000140549 cfunction_vectorcall_FASTCALL() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/methodobject.c:422 39 0x00000000001491bd PyVectorcall_Call() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:199 40 0x00000000001391b2 _PyEval_EvalFrameDefault() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4983 41 0x00000000001391b2 ???() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:3559 42 0x0000000000132821 _PyEval_EvalCodeWithName() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4298 43 0x0000000000140003 _PyFunction_Vectorcall() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:435 44 0x0000000000134832 _PyEval_EvalFrameDefault() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:127 45 0x0000000000134832 ???() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4963 46 0x0000000000134832 ???() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:3469 47 0x00000000001402ea function_code_fastcall() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:283 48 0x0000000000133f99 _PyEval_EvalFrameDefault() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:127 49 0x0000000000133f99 ???() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4963 50 0x0000000000133f99 ???() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:3486 51 0x00000000001402ea function_code_fastcall() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:283 52 0x0000000000133da1 _PyEval_EvalFrameDefault() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:127 53 0x0000000000133da1 ???() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4963 54 0x0000000000133da1 ???() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:3500 55 0x00000000001402ea function_code_fastcall() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:283 56 0x0000000000133da1 _PyEval_EvalFrameDefault() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:127 57 0x0000000000133da1 ???() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4963 58 0x0000000000133da1 ???() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:3500 59 0x00000000001402ea function_code_fastcall() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:283 60 0x000000000013f896 object_vacall() /groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:127
=================================
[sn537:02026] *** Process received signal ***
[sn537:02026] Signal: Aborted (6)
[sn537:02026] Signal code:  (-6)
[sn537:02026] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2aaaab7c45d0]
[sn537:02026] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aaaabc0a207]
[sn537:02026] [ 2] /lib64/libc.so.6(abort+0x148)[0x2aaaabc0b8f8]
[sn537:02026] [ 3] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx/lib/libucs.so.0(ucs_fatal_error_message+0x55)[0x2aaac6d88c85] [sn537:02026] [ 4] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx/lib/libucs.so.0(+0x532b5)[0x2aaac6d8d2b5] [sn537:02026] [ 5] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx/lib/libucs.so.0(ucs_log_dispatch+0xc4)[0x2aaac6d8d3e4] [sn537:02026] [ 6] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x683)[0x2aaac7933793] [sn537:02026] [ 7] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx/lib/ucx/libuct_ib.so.0(+0x39d86)[0x2aaac7950d86] [sn537:02026] [ 8] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx/lib/libucp.so.0(ucp_worker_progress+0x22)[0x2aaac68dfcf2] [sn537:02026] [ 9] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x2aaac64b46c7] [sn537:02026] [10] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x2aaab4f3582c] [sn537:02026] [11] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/hcoll/lib/libhcoll.so.1(+0xbf179)[0x2aaad131f179] [sn537:02026] [12] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/hcoll/lib/libhcoll.so.1(comm_allgather_hcolrte+0xcae)[0x2aaad127e52e] [sn537:02026] [13] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(+0x138b7)[0x2aaad6ee38b7] [sn537:02026] [14] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x4c)[0x2aaad132b86c] [sn537:02026] [15] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x68)[0x2aaad12aa328] [sn537:02026] [16] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x307)[0x2aaad131ff37] [sn537:02026] [17] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x103)[0x2aaad1054ee3] [sn537:02026] [18] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/lib/libmpi.so.40(mca_coll_base_comm_select+0x2dd)[0x2aaab463281d] [sn537:02026] [19] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/lib/libmpi.so.40(ompi_mpi_init+0xc6d)[0x2aaab4667a2d] [sn537:02026] [20] /apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/lib/libmpi.so.40(MPI_Init+0x59)[0x2aaab4624a29] [sn537:02026] [21] /groups/physics/modules/software/GPAW/21.1.0-foss-2020b-ASE-3.21.1/lib/python3.8/site-packages/_gpaw.cpython-38-x86_64-linux-gnu.so(+0x3bb69)[0x2aaab2809b69] [sn537:02026] [22] /groups/physics/modules/software/GPAW/21.1.0-foss-2020b-ASE-3.21.1/lib/python3.8/site-packages/_gpaw.cpython-38-x86_64-linux-gnu.so(+0x3bc2a)[0x2aaab2809c2a] [sn537:02026] [23] /groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(+0x13b3e6)[0x2aaaaae0a3e6] [sn537:02026] [24] /groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(_PyObject_MakeTpCall+0x81)[0x2aaaaae0a201] [sn537:02026] [25] /groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x4dec)[0x2aaaaae077fc] [sn537:02026] [26] /groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x2e1)[0x2aaaaae01821] [sn537:02026] [27] /groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(PyEval_EvalCodeEx+0x39)[0x2aaaaae01539] [sn537:02026] [28] /groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(PyEval_EvalCode+0x1b)[0x2aaaaae7453b] [sn537:02026] [29] /groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(+0x1aa8e5)[0x2aaaaae798e5]
[sn537:02026] *** End of error message ***



--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

Reply via email to