Already tried those but it didn't help. I have been trying to experiment with 48x1, 24x2 etc. and performance degraded for the climate workload.
________________________________ From: Mark Adams <mfad...@lbl.gov> Sent: Tuesday, April 20, 2021 8:14:12 PM To: Sreepathi, Sarat <sa...@ornl.gov> Cc: petsc-dev <petsc-dev@mcs.anl.gov>; Satish Balay <ba...@mcs.anl.gov> Subject: Re: [petsc-dev] [EXTERNAL] Re: building on Fugaku I settled on this: . /vol0004/apps/oss/spack/share/spack/setup-env.sh spack load gcc@10.2.0%gcc@8.3.1 arch=linux-rhel8-a64fx spack load fujitsu-mpi@4.5.0%gcc@8.3.1 arch=linux-rhel8-a64fx 'CC=mpicc', 'CXX=mpiCC', 'FC=mpif90', 'COPTFLAGS=-Ofast -march=armv8.2-a+sve -msve-vector-bits=512', 'CXXOPTFLAGS=-Ofast -march=armv8.2-a+sve -msve-vector-bits=512', Kokkos suggested this and I noticed that it helped: export OMP_PROC_BIND=spread export OMP_PLACES=threads I am getting great thread scaling but I seem to get no vectorization. I discussed this with Kokkos (Trott) today and he is not surprised. Auto vectorization is fragile. On Mon, Apr 19, 2021 at 8:26 PM Sreepathi, Sarat <sa...@ornl.gov<mailto:sa...@ornl.gov>> wrote: My turn: did you folks figure out tips for performant hybrid MPI+OMP core binding? I tried some from the documentation but that didn’t seem to help. -Sarat. From: Sreepathi, Sarat Sent: Friday, April 16, 2021 3:02 PM To: Mark Adams <mfad...@lbl.gov<mailto:mfad...@lbl.gov>>; petsc-dev <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> Cc: Satish Balay <ba...@mcs.anl.gov<mailto:ba...@mcs.anl.gov>> Subject: RE: [petsc-dev] [EXTERNAL] Re: building on Fugaku It’s 48 cores but there are 4 NUMA domains (CMGs). So, you may want to experiment in hybrid mode (4x12 etc.) if possible. -Sarat. From: Mark Adams <mfad...@lbl.gov<mailto:mfad...@lbl.gov>> Sent: Friday, April 16, 2021 2:10 PM To: petsc-dev <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> Cc: Satish Balay <ba...@mcs.anl.gov<mailto:ba...@mcs.anl.gov>>; Sreepathi, Sarat <sa...@ornl.gov<mailto:sa...@ornl.gov>> Subject: Re: [petsc-dev] [EXTERNAL] Re: building on Fugaku Sarat, is there anything special that you do for Kokkos - OpenMP? Just set OMP_NUM_THREADS=48 ? Also, I am confused about the number of cores here. Is 48 or 64 per node/socket? On Fri, Apr 16, 2021 at 2:03 PM Mark Adams <mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote: Cool, I have it running too. Need to add Sarat's flags and test ex2. On Fri, Apr 16, 2021 at 1:57 PM Satish Balay via petsc-dev <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> wrote: Mark, The following build works for me: Satish ---- pjsub --interact -L "node=1" -L "rscunit=rscunit_ft01" -L "elapse=1:00:00" --sparam "wait-time=1200" . /vol0004/apps/oss/spack/share/spack/setup-env.sh spack load fujitsu-mpi%gcc spack load gcc@10.2.0<mailto:gcc@10.2.0> arch=linux-rhel8-a64fx ./configure COPTFLAGS='-Ofast -march=armv8.2-a+sve -msve-vector-bits=512' CXXOPTFLAGS='-Ofast -march=armv8.2-a+sve -msve-vector-bits=512' FOPTFLAGS='-Ofast -march=armv8.2-a+sve -msve-vector-bits=512' --with-openmp=1 --download-p4est --download-zlib --download-kokkos --download-kokkos-kernels --download-kokkos-commit=origin/develop --download-kokkos-kernels-commit=origin/develop '--download-kokkos-cmake-arguments=-DBUILD_TESTING=OFF -DKokkos_ENABLE_LIBDL=OFF -DKokkos_ENABLE_AGGRESSIVE_VECTORIZATION=ON' --download-cmake=https://github.com/Kitware/CMake/releases/download/v3.20.1/cmake-3.20.1.tar.gz --download-fblaslapack=1 make PETSC_DIR=/vol0004/ra010009/a04201/petsc.z PETSC_ARCH=arch-linux-c-debug all To test - redo job allocation using max-proc-per-node: login6$ pjsub --interact -L "node=1" -L "rscunit=rscunit_ft01" -L "elapse=1:00:00" --sparam "wait-time=1200" --mpi "max-proc-per-node=16" [a04201@c31-3201c petsc.z]$ . /vol0004/apps/oss/spack/share/spack/setup-env.sh [a04201@c31-3201c petsc.z]$ spack load fujitsu-mpi%gcc [a04201@c31-3201c petsc.z]$ spack load gcc@10.2.0<mailto:gcc@10.2.0> arch=linux-rhel8-a64fx [a04201@c31-3201c petsc.z]$ make check Running check examples to verify correct installation Using PETSC_DIR=/vol0004/ra010009/a04201/petsc.z and PETSC_ARCH=arch-linux-c-debug C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes C/C++ example src/snes/tutorials/ex3k run successfully with kokkos-kernels Fortran example src/snes/tutorials/ex5f run successfully with 1 MPI process Completed test examples [a04201@c31-3201c petsc.z]$