Hi EasyBuilders,
We have an AMD EPYC 7313 node (running AlmaLinux 8.10) with two AMD
Instinct MI210 GPUs:
# rocm-smi --showhw
===================================== ROCm System Management Interface
=====================================
========================================== Concise Hardware Info
===========================================
GPU NODE DID GUID GFX VER GFX RAS SDMA RAS UMC RAS VBIOS
BUS PARTITION ID
0 8 0x740f 63484 gfx9010 ENABLED ENABLED ENABLED
113-D67301-064D 0000:23:00.0 0
1 9 0x740f 36740 gfx9010 ENABLED ENABLED ENABLED
113-D67301-064D 0000:83:00.0 0
============================================================================================================
=========================================== End of ROCm SMI Log
============================================
We have installed ROCm 6.2.2 libraries, and now we need to build our
application which requires OpenMPI.
We have found out that ROCm 6.2 requires UCC >= 1.3.0 and UCX >= 1.15.0,
see
https://rocm.docs.amd.com/en/docs-6.2.0/compatibility/compatibility-matrix.html
Therefore we need to build OpenMPI-5.0.3-GCC-13.3.0.eb in order to get
supported UCX and UCC versions. Unfortunately the prerequisite
UCC-1.3.0-GCCcore-13.3.0.eb fails to build because this command fails:
$ /opt/rocm-6.2.2/lib/llvm/bin/amdgpu-arch
Failed to get device count
Question: Does anyone know how to fix the amdgpu-arch command so that it
recognizes the AMD MI210 GPU (gfx version gfx9010)?
FYI the UCC build log says:
Making all in kernel
make[4]: Entering directory
'/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/components/ec/rocm/kernel'
/bin/bash ../../../../../cuda_lt.sh "/bin/sh ../../../../../libtool"
ec_rocm_executor_kernel.lo /opt/rocm/bin/amdclang -c ec_rocm_executor_kernel.cu
-D__HIP_PLATFORM_AMD__ -I/opt/rocm/include/hip -I/opt/rocm/include
-I/opt/rocm/llvm/include -I/opt/rocm/include/hsa -I/opt/rocm/include
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/components/ec/rocm
/bin/bash ../../../../../cuda_lt.sh "/bin/sh ../../../../../libtool"
ec_rocm_reduce.lo /opt/rocm/bin/amdclang -c ec_rocm_reduce.cu -D__HIP_PLATFORM_AMD__
-I/opt/rocm/include/hip -I/opt/rocm/include -I/opt/rocm/llvm/include
-I/opt/rocm/include/hsa -I/opt/rocm/include -I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/components/ec/rocm
/opt/rocm/bin/amdclang -c -x hip -target x86_64-unknown-linux-gnu
--offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx940
--offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1030
--offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102
--offload-arch=native ec_rocm_reduce.cu -D__HIP_PLATFORM_AMD__
-I/opt/rocm/include/hip -I/opt/rocm/include -I/opt/rocm/llvm/include
-I/opt/rocm/include/hsa -I/opt/rocm/include
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/components/ec/rocm -fPIC -O3
-o ./.libs/ec_rocm_reduce.o
/opt/rocm/bin/amdclang -c -x hip -target x86_64-unknown-linux-gnu
--offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx940
--offload-arch=gfx941 --offload-arch=gfx942 --offload-arch=gfx1030
--offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102
--offload-arch=native ec_rocm_executor_kernel.cu -D__HIP_PLATFORM_AMD__
-I/opt/rocm/include/hip -I/opt/rocm/include -I/opt/rocm/llvm/include
-I/opt/rocm/include/hsa -I/opt/rocm/include
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src
-I/dev/shm/UCC/1.3.0/GCCcore-13.3.0/ucc-1.3.0/src/components/ec/rocm -fPIC -O3
-o ./.libs/ec_rocm_executor_kernel.o
clang: error: cannot determine amdgcn architecture:
/opt/rocm-6.2.2/lib/llvm/bin/amdgpu-arch: ; consider passing it via
'--offload-arch'
clang: error: cannot determine amdgcn architecture:
/opt/rocm-6.2.2/lib/llvm/bin/amdgpu-arch: ; consider passing it via
'--offload-arch'
Thanks a lot,
Ole
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark