Glad you figured it out! --Junchao Zhang
On Sun, Feb 4, 2024 at 7:56 PM Yesypenko, Anna <a...@oden.utexas.edu> wrote: > Hi Junchao, Victor, > > I fixed the issue! The issue was with the CPU bindings. Python has a > limitation that it only runs on one core. > I had to modify the MPI thread launch script to make sure that each python > instance is bound to only one physical core. > > Thank you both very much for your patience and help! > > Best, > Anna > ------------------------------ > *From:* Yesypenko, Anna <a...@oden.utexas.edu> > *Sent:* Friday, February 2, 2024 2:12 PM > *To:* Junchao Zhang <junchao.zh...@gmail.com> > *Cc:* Victor Eijkhout <eijkh...@tacc.utexas.edu>; petsc-users@mcs.anl.gov > <petsc-users@mcs.anl.gov> > *Subject:* Re: [petsc-users] errors with hypre with MPI and multiple GPUs > on a node > > Hi Junchao, > > Unfortunately I don't have access to other cuda machines with multiple > GPUs. > I'm pretty stuck, and I think running on a different machine would help > isolate the issue. > > I'm sharing the python script and the launch script that Victor wrote. > There is a comment in the launch script with the mpi command I was using > to run the python script. > I configured hypre without unified memory. In case it's useful, I also > attached the configure.log. > > If the issue is with petsc/hypre, it may be in the environment variables > described here (e.g. HYPRE_MEMORY_DEVICE): > https://github.com/hypre-space/hypre/wiki/GPUs > > Thank you for helping me troubleshoot this issue! > Best, > Anna > > > > > > > ------------------------------ > *From:* Junchao Zhang <junchao.zh...@gmail.com> > *Sent:* Thursday, February 1, 2024 9:07 PM > *To:* Yesypenko, Anna <a...@oden.utexas.edu> > *Cc:* Victor Eijkhout <eijkh...@tacc.utexas.edu>; petsc-users@mcs.anl.gov > <petsc-users@mcs.anl.gov> > *Subject:* Re: [petsc-users] errors with hypre with MPI and multiple GPUs > on a node > > Hi, Anna, > Do you have other CUDA machines to try? If you can share your test, > then I will run on Polaris@Argonne to see if it is a petsc/hypre issue. > If not, then it must be a GPU-MPI binding problem on TACC. > > Thanks > --Junchao Zhang > > > On Thu, Feb 1, 2024 at 5:31 PM Yesypenko, Anna <a...@oden.utexas.edu> > wrote: > > Hi Victor, Junchao, > > Thank you for providing the script, it is very useful! > There are still issues with hypre not binding correctly, and I'm getting > the error message occasionally (but much less often). > I added some additional environment variables to the script that seem to > make the behavior more consistent. > > export CUDA_DEVICE_ORDER=PCI_BUS_ID > export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK ## as Victor > suggested > export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK > > The last environment variable is from hypre's documentation on GPUs. > In 30 runs for a small problem size, 4 fail with a hypre-related error. Do > you have any other thoughts or suggestions? > > Best, > Anna > > ------------------------------ > *From:* Victor Eijkhout <eijkh...@tacc.utexas.edu> > *Sent:* Thursday, February 1, 2024 11:26 AM > *To:* Junchao Zhang <junchao.zh...@gmail.com>; Yesypenko, Anna < > a...@oden.utexas.edu> > *Cc:* petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov> > *Subject:* Re: [petsc-users] errors with hypre with MPI and multiple GPUs > on a node > > > Only for mvapich2-gdr: > > > > #!/bin/bash > > # Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 > ./launch ./bin > > > > export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK > > case $MV2_COMM_WORLD_LOCAL_RANK in > > [0]) cpus=0-3 ;; > > [1]) cpus=64-67 ;; > > [2]) cpus=72-75 ;; > > esac > > > > numactl --physcpubind=$cpus $@ > > > >