Hi, Anna, Do you have other CUDA machines to try? If you can share your test, then I will run on Polaris@Argonne to see if it is a petsc/hypre issue. If not, then it must be a GPU-MPI binding problem on TACC.
Thanks --Junchao Zhang On Thu, Feb 1, 2024 at 5:31 PM Yesypenko, Anna <a...@oden.utexas.edu> wrote: > Hi Victor, Junchao, > > Thank you for providing the script, it is very useful! > There are still issues with hypre not binding correctly, and I'm getting > the error message occasionally (but much less often). > I added some additional environment variables to the script that seem to > make the behavior more consistent. > > export CUDA_DEVICE_ORDER=PCI_BUS_ID > export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK ## as Victor > suggested > export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK > > The last environment variable is from hypre's documentation on GPUs. > In 30 runs for a small problem size, 4 fail with a hypre-related error. Do > you have any other thoughts or suggestions? > > Best, > Anna > > ------------------------------ > *From:* Victor Eijkhout <eijkh...@tacc.utexas.edu> > *Sent:* Thursday, February 1, 2024 11:26 AM > *To:* Junchao Zhang <junchao.zh...@gmail.com>; Yesypenko, Anna < > a...@oden.utexas.edu> > *Cc:* petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov> > *Subject:* Re: [petsc-users] errors with hypre with MPI and multiple GPUs > on a node > > > Only for mvapich2-gdr: > > > > #!/bin/bash > > # Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 > ./launch ./bin > > > > export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK > > case $MV2_COMM_WORLD_LOCAL_RANK in > > [0]) cpus=0-3 ;; > > [1]) cpus=64-67 ;; > > [2]) cpus=72-75 ;; > > esac > > > > numactl --physcpubind=$cpus $@ > > >