I settled on this: . /vol0004/apps/oss/spack/share/spack/setup-env.sh spack load gcc@10.2.0%gcc@8.3.1 arch=linux-rhel8-a64fx spack load fujitsu-mpi@4.5.0%gcc@8.3.1 arch=linux-rhel8-a64fx
'CC=mpicc', 'CXX=mpiCC', 'FC=mpif90', 'COPTFLAGS=-Ofast -march=armv8.2-a+sve -msve-vector-bits=512', 'CXXOPTFLAGS=-Ofast -march=armv8.2-a+sve -msve-vector-bits=512', Kokkos suggested this and I noticed that it helped: export OMP_PROC_BIND=spread export OMP_PLACES=threads I am getting great thread scaling but I seem to get no vectorization. I discussed this with Kokkos (Trott) today and he is not surprised. Auto vectorization is fragile. On Mon, Apr 19, 2021 at 8:26 PM Sreepathi, Sarat <sa...@ornl.gov> wrote: > My turn: did you folks figure out tips for performant hybrid MPI+OMP core > binding? I tried some from the documentation but that didn’t seem to help. > > > > -Sarat. > > > > *From:* Sreepathi, Sarat > *Sent:* Friday, April 16, 2021 3:02 PM > *To:* Mark Adams <mfad...@lbl.gov>; petsc-dev <petsc-dev@mcs.anl.gov> > *Cc:* Satish Balay <ba...@mcs.anl.gov> > *Subject:* RE: [petsc-dev] [EXTERNAL] Re: building on Fugaku > > > > It’s 48 cores but there are 4 NUMA domains (CMGs). So, you may want to > experiment in hybrid mode (4x12 etc.) if possible. > > > > -Sarat. > > > > *From:* Mark Adams <mfad...@lbl.gov> > *Sent:* Friday, April 16, 2021 2:10 PM > *To:* petsc-dev <petsc-dev@mcs.anl.gov> > *Cc:* Satish Balay <ba...@mcs.anl.gov>; Sreepathi, Sarat <sa...@ornl.gov> > *Subject:* Re: [petsc-dev] [EXTERNAL] Re: building on Fugaku > > > > Sarat, is there anything special that you do for Kokkos - OpenMP? > > > > Just set OMP_NUM_THREADS=48 ? > > > > Also, I am confused about the number of cores here. Is 48 or 64 per > node/socket? > > > > On Fri, Apr 16, 2021 at 2:03 PM Mark Adams <mfad...@lbl.gov> wrote: > > Cool, I have it running too. Need to add Sarat's flags and test ex2. > > > > On Fri, Apr 16, 2021 at 1:57 PM Satish Balay via petsc-dev < > petsc-dev@mcs.anl.gov> wrote: > > Mark, > > The following build works for me: > > Satish > > ---- > > pjsub --interact -L "node=1" -L "rscunit=rscunit_ft01" -L "elapse=1:00:00" > --sparam "wait-time=1200" > > . /vol0004/apps/oss/spack/share/spack/setup-env.sh > spack load fujitsu-mpi%gcc > spack load gcc@10.2.0 arch=linux-rhel8-a64fx > ./configure COPTFLAGS='-Ofast -march=armv8.2-a+sve -msve-vector-bits=512' > CXXOPTFLAGS='-Ofast -march=armv8.2-a+sve -msve-vector-bits=512' > FOPTFLAGS='-Ofast -march=armv8.2-a+sve -msve-vector-bits=512' > --with-openmp=1 --download-p4est --download-zlib --download-kokkos > --download-kokkos-kernels --download-kokkos-commit=origin/develop > --download-kokkos-kernels-commit=origin/develop > '--download-kokkos-cmake-arguments=-DBUILD_TESTING=OFF > -DKokkos_ENABLE_LIBDL=OFF -DKokkos_ENABLE_AGGRESSIVE_VECTORIZATION=ON' > --download-cmake= > https://github.com/Kitware/CMake/releases/download/v3.20.1/cmake-3.20.1.tar.gz > --download-fblaslapack=1 > make PETSC_DIR=/vol0004/ra010009/a04201/petsc.z > PETSC_ARCH=arch-linux-c-debug all > > > To test - redo job allocation using max-proc-per-node: > > login6$ pjsub --interact -L "node=1" -L "rscunit=rscunit_ft01" -L > "elapse=1:00:00" --sparam "wait-time=1200" --mpi "max-proc-per-node=16" > > [a04201@c31-3201c petsc.z]$ . > /vol0004/apps/oss/spack/share/spack/setup-env.sh > [a04201@c31-3201c petsc.z]$ spack load fujitsu-mpi%gcc > [a04201@c31-3201c petsc.z]$ spack load gcc@10.2.0 arch=linux-rhel8-a64fx > [a04201@c31-3201c petsc.z]$ make check > Running check examples to verify correct installation > Using PETSC_DIR=/vol0004/ra010009/a04201/petsc.z and > PETSC_ARCH=arch-linux-c-debug > C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process > C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes > C/C++ example src/snes/tutorials/ex3k run successfully with kokkos-kernels > Fortran example src/snes/tutorials/ex5f run successfully with 1 MPI process > Completed test examples > [a04201@c31-3201c petsc.z]$ > >