We have installed EasyBuild on another cluster running OpenHPC 1.3
(ohpc-release-1.3-2.el7.x86_64) with CentOS Linux release 7.6.1810. The
hardware is AMD EPYC Naples and Rome, and the interconnect is Mellanox
ConnectX-5 (Naples) and ConnectX-6 (Rome). The Mellanox drivers on the
system seem to be
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/
We have built the EB module GPAW/21.1.0-foss-2020b-ASE-3.21.1 for our
production code. The foss/2020b toolchain includes
OpenMPI/4.0.5-GCC-10.2.0.
We have a user reporting that GPAW crashes on the AMD Rome nodes, but
only when running on multiple nodes, i.e., when using the Mellanox
interconnect. Interestingly, the crashes only occur on Rome nodes, not
on Naples nodes (I built the EB modules on a Naples login node).
The Mellanox drivers issues lots of UCX warnings shown below.
Eventually the code crashes in MPI collective operations with UCS error
messages from many tasks as shown below.
Questions:
* Are there known issues with OpenMPI/4.0.5-GCC-10.2.0 and the old
MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64 driver or and old CentOS
7.6 OS?
* Could the ConnectX-6 adapter on the Rome nodes cause problems because
we built foss/2020b on a Naples node with ConnectX-5?
* Would an older foss toolchain be recommended on such an old OpenHPC
1.3 software stack?
Thanks a lot,
Ole
Details from the job output file
--------------------------------
[1624476455.369726] [sn537:2031 :0] uct_iface.c:66 UCX WARN got
active message id 0, but no handler installed
[1624476455.369729] [sn537:2031 :0] uct_iface.c:70 UCX WARN
payload:
[1624476455.369729] [sn537:2031 :0] uct_iface.c:70 UCX WARN
00000000:00000000:00000000:00000000
[1624476455.369729] [sn537:2031 :0] uct_iface.c:70 UCX WARN
00000000:00000000:00000000:00000000
[1624476455.369729] [sn537:2031 :0] uct_iface.c:70 UCX WARN
00000000:00000000:00000000:00000000
[1624476455.369729] [sn537:2031 :0] uct_iface.c:70 UCX WARN
00000000:00000000:00000000:00000000
[sn537:2026 :0:2026] ib_mlx5_log.c:139 Local protection on mlx5_0:1/IB
(synd 0x4 vend 0x51 hw_synd 0/4)
[sn537:2026 :0:2026] ib_mlx5_log.c:139 DCI QP 0x77bc wqe[1]: SEND s-e
[rqpn 0x6b3a rlid 597] [va 0x2aaadbbfddc0 len 8192 lkey 0xffbf1]
Eventually the code crashes in MPI collective operations with UCS error
messages such as:
==== backtrace (tid: 2026) ====
0 0x000000000004ec80 ucs_fatal_error_message()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/assert.c:33
1 0x00000000000532b5 ucs_log_default_handler()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:140
2 0x00000000000533e4 ucs_log_dispatch()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:191
3 0x000000000001c793 uct_ib_mlx5_completion_with_err()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5_log.c:132
4 0x0000000000039d86 uct_ib_mlx5_poll_cq()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5.inl:38
5 0x0000000000039d86 uct_dc_mlx5_iface_progress()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/dc/dc_mlx5.c:238
6 0x000000000001fcf2 ucs_callbackq_dispatch()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/datastruct/callbackq.h:211
7 0x000000000001fcf2 uct_worker_progress()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/api/uct.h:2203
8 0x000000000001fcf2 ucp_worker_progress()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucp/core/ucp_worker.c:1897
9 0x00000000000036c7 mca_pml_ucx_progress()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/pml/ucx/pml_ucx.c:515
10 0x000000000003682c opal_progress()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/opal/runtime/opal_progress.c:231
11 0x00000000000bf179 wait_completion() hcoll_collectives.c:0
12 0x000000000001e52e comm_allgather_hcolrte() ???:0
13 0x00000000000138b7 hmca_bcol_ucx_p2p_init_query.part.4()
bcol_ucx_p2p_component.c:0
14 0x00000000000cb86c hmca_bcol_base_init() ???:0
15 0x000000000004a328 hmca_coll_ml_init_query() ???:0
16 0x00000000000bff37 hcoll_init_with_opts() ???:0
17 0x0000000000004ee3 mca_coll_hcoll_comm_query()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/coll/hcoll/coll_hcoll_module.c:292
18 0x000000000007881d query_2_0_0()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/coll/base/coll_base_comm_select.c:449
19 0x000000000007881d query()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/coll/base/coll_base_comm_select.c:432
20 0x000000000007881d check_one_component()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/coll/base/coll_base_comm_select.c:394
21 0x000000000007881d check_components()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/coll/base/coll_base_comm_select.c:344
22 0x000000000007881d mca_coll_base_comm_select()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mca/coll/base/coll_base_comm_select.c:126
23 0x00000000000ada2d ompi_mpi_init()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/runtime/ompi_mpi_init.c:957
24 0x000000000006aa29 PMPI_Init()
/build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/ompi/mpi/c/profile/pinit.c:69
25 0x000000000003bb69 mpi_ensure_initialized()
/tmp/eb-hI2Q2G/pip-req-build-dhwtzdzz/c/mpi.c:233
26 0x000000000003bc2a NewMPIObject()
/tmp/eb-hI2Q2G/pip-req-build-dhwtzdzz/c/mpi.c:1141
27 0x000000000013b3e6 type_call()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/typeobject.c:974
28 0x000000000013b201 _PyObject_MakeTpCall()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:159
29 0x00000000001387fc _PyEval_EvalFrameDefault()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:125
30 0x00000000001387fc ???()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:115
31 0x00000000001387fc ???()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4963
32 0x00000000001387fc ???()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:3469
33 0x0000000000132821 _PyEval_EvalCodeWithName()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4298
34 0x0000000000132539 PyEval_EvalCodeEx()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4327
35 0x00000000001a553b PyEval_EvalCode()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:718
36 0x00000000001aa8e5 builtin_exec()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/bltinmodule.c:1033
37 0x00000000001aa8e5 ???()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/clinic/bltinmodule.c.h:396
38 0x0000000000140549 cfunction_vectorcall_FASTCALL()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/methodobject.c:422
39 0x00000000001491bd PyVectorcall_Call()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:199
40 0x00000000001391b2 _PyEval_EvalFrameDefault()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4983
41 0x00000000001391b2 ???()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:3559
42 0x0000000000132821 _PyEval_EvalCodeWithName()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4298
43 0x0000000000140003 _PyFunction_Vectorcall()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:435
44 0x0000000000134832 _PyEval_EvalFrameDefault()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:127
45 0x0000000000134832 ???()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4963
46 0x0000000000134832 ???()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:3469
47 0x00000000001402ea function_code_fastcall()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:283
48 0x0000000000133f99 _PyEval_EvalFrameDefault()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:127
49 0x0000000000133f99 ???()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4963
50 0x0000000000133f99 ???()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:3486
51 0x00000000001402ea function_code_fastcall()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:283
52 0x0000000000133da1 _PyEval_EvalFrameDefault()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:127
53 0x0000000000133da1 ???()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4963
54 0x0000000000133da1 ???()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:3500
55 0x00000000001402ea function_code_fastcall()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:283
56 0x0000000000133da1 _PyEval_EvalFrameDefault()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:127
57 0x0000000000133da1 ???()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:4963
58 0x0000000000133da1 ???()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Python/ceval.c:3500
59 0x00000000001402ea function_code_fastcall()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/Objects/call.c:283
60 0x000000000013f896 object_vacall()
/groups/physics/modules/build/Python/3.8.6/GCCcore-10.2.0/Python-3.8.6/./Include/cpython/abstract.h:127
=================================
[sn537:02026] *** Process received signal ***
[sn537:02026] Signal: Aborted (6)
[sn537:02026] Signal code: (-6)
[sn537:02026] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2aaaab7c45d0]
[sn537:02026] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aaaabc0a207]
[sn537:02026] [ 2] /lib64/libc.so.6(abort+0x148)[0x2aaaabc0b8f8]
[sn537:02026] [ 3]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx/lib/libucs.so.0(ucs_fatal_error_message+0x55)[0x2aaac6d88c85]
[sn537:02026] [ 4]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx/lib/libucs.so.0(+0x532b5)[0x2aaac6d8d2b5]
[sn537:02026] [ 5]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx/lib/libucs.so.0(ucs_log_dispatch+0xc4)[0x2aaac6d8d3e4]
[sn537:02026] [ 6]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x683)[0x2aaac7933793]
[sn537:02026] [ 7]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx/lib/ucx/libuct_ib.so.0(+0x39d86)[0x2aaac7950d86]
[sn537:02026] [ 8]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx/lib/libucp.so.0(ucp_worker_progress+0x22)[0x2aaac68dfcf2]
[sn537:02026] [ 9]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x2aaac64b46c7]
[sn537:02026] [10]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x2aaab4f3582c]
[sn537:02026] [11]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/hcoll/lib/libhcoll.so.1(+0xbf179)[0x2aaad131f179]
[sn537:02026] [12]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/hcoll/lib/libhcoll.so.1(comm_allgather_hcolrte+0xcae)[0x2aaad127e52e]
[sn537:02026] [13]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(+0x138b7)[0x2aaad6ee38b7]
[sn537:02026] [14]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/hcoll/lib/libhcoll.so.1(hmca_bcol_base_init+0x4c)[0x2aaad132b86c]
[sn537:02026] [15]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/hcoll/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x68)[0x2aaad12aa328]
[sn537:02026] [16]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/hcoll/lib/libhcoll.so.1(hcoll_init_with_opts+0x307)[0x2aaad131ff37]
[sn537:02026] [17]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x103)[0x2aaad1054ee3]
[sn537:02026] [18]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/lib/libmpi.so.40(mca_coll_base_comm_select+0x2dd)[0x2aaab463281d]
[sn537:02026] [19]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/lib/libmpi.so.40(ompi_mpi_init+0xc6d)[0x2aaab4667a2d]
[sn537:02026] [20]
/apps/external/hpcx/2.5.0/MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ompi/lib/libmpi.so.40(MPI_Init+0x59)[0x2aaab4624a29]
[sn537:02026] [21]
/groups/physics/modules/software/GPAW/21.1.0-foss-2020b-ASE-3.21.1/lib/python3.8/site-packages/_gpaw.cpython-38-x86_64-linux-gnu.so(+0x3bb69)[0x2aaab2809b69]
[sn537:02026] [22]
/groups/physics/modules/software/GPAW/21.1.0-foss-2020b-ASE-3.21.1/lib/python3.8/site-packages/_gpaw.cpython-38-x86_64-linux-gnu.so(+0x3bc2a)[0x2aaab2809c2a]
[sn537:02026] [23]
/groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(+0x13b3e6)[0x2aaaaae0a3e6]
[sn537:02026] [24]
/groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(_PyObject_MakeTpCall+0x81)[0x2aaaaae0a201]
[sn537:02026] [25]
/groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x4dec)[0x2aaaaae077fc]
[sn537:02026] [26]
/groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x2e1)[0x2aaaaae01821]
[sn537:02026] [27]
/groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(PyEval_EvalCodeEx+0x39)[0x2aaaaae01539]
[sn537:02026] [28]
/groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(PyEval_EvalCode+0x1b)[0x2aaaaae7453b]
[sn537:02026] [29]
/groups/physics/modules/software/Python/3.8.6-GCCcore-10.2.0/lib/libpython3.8.so.1.0(+0x1aa8e5)[0x2aaaaae798e5]
[sn537:02026] *** End of error message ***
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark