Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node

Junchao Zhang Sun, 04 Feb 2024 18:26:02 -0800

Glad you figured it out!

--Junchao Zhang



On Sun, Feb 4, 2024 at 7:56 PM Yesypenko, Anna <a...@oden.utexas.edu> wrote:

> Hi Junchao, Victor,
>
> I fixed the issue! The issue was with the CPU bindings. Python has a
> limitation that it only runs on one core.
> I had to modify the MPI thread launch script to make sure that each python
> instance is bound to only one physical core.
>
> Thank you both very much for your patience and help!
>
> Best,
> Anna
> ------------------------------
> *From:* Yesypenko, Anna <a...@oden.utexas.edu>
> *Sent:* Friday, February 2, 2024 2:12 PM
> *To:* Junchao Zhang <junchao.zh...@gmail.com>
> *Cc:* Victor Eijkhout <eijkh...@tacc.utexas.edu>; petsc-users@mcs.anl.gov
> <petsc-users@mcs.anl.gov>
> *Subject:* Re: [petsc-users] errors with hypre with MPI and multiple GPUs
> on a node
>
> Hi Junchao,
>
> Unfortunately I don't have access to other cuda machines with multiple
> GPUs.
> I'm pretty stuck, and I think running on a different machine would help
> isolate the issue.
>
> I'm sharing the python script and the launch script that Victor wrote.
> There is a comment in the launch script with the mpi command I was using
> to run the python script.
> I configured hypre without unified memory. In case it's useful, I also
> attached the configure.log.
>
> If the issue is with petsc/hypre, it may be in the environment variables
> described here (e.g. HYPRE_MEMORY_DEVICE):
> https://github.com/hypre-space/hypre/wiki/GPUs
>
> Thank you for helping me troubleshoot this issue!
> Best,
> Anna
>
>
>
>
>
>
> ------------------------------
> *From:* Junchao Zhang <junchao.zh...@gmail.com>
> *Sent:* Thursday, February 1, 2024 9:07 PM
> *To:* Yesypenko, Anna <a...@oden.utexas.edu>
> *Cc:* Victor Eijkhout <eijkh...@tacc.utexas.edu>; petsc-users@mcs.anl.gov
> <petsc-users@mcs.anl.gov>
> *Subject:* Re: [petsc-users] errors with hypre with MPI and multiple GPUs
> on a node
>
> Hi, Anna,
>   Do you have other CUDA machines to try?  If you can share your test,
> then I will run on Polaris@Argonne to see if it is a petsc/hypre issue.
> If not, then it must be a GPU-MPI binding problem on TACC.
>
>   Thanks
> --Junchao Zhang
>
>
> On Thu, Feb 1, 2024 at 5:31 PM Yesypenko, Anna <a...@oden.utexas.edu>
> wrote:
>
> Hi Victor, Junchao,
>
> Thank you for providing the script, it is very useful!
> There are still issues with hypre not binding correctly, and I'm getting
> the error message occasionally (but much less often).
> I added some additional environment variables to the script that seem to
> make the behavior more consistent.
>
> export CUDA_DEVICE_ORDER=PCI_BUS_ID
> export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK    ## as Victor
> suggested
> export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK
>
> The last environment variable is from hypre's documentation on GPUs.
> In 30 runs for a small problem size, 4 fail with a hypre-related error. Do
> you have any other thoughts or suggestions?
>
> Best,
> Anna
>
> ------------------------------
> *From:* Victor Eijkhout <eijkh...@tacc.utexas.edu>
> *Sent:* Thursday, February 1, 2024 11:26 AM
> *To:* Junchao Zhang <junchao.zh...@gmail.com>; Yesypenko, Anna <
> a...@oden.utexas.edu>
> *Cc:* petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov>
> *Subject:* Re: [petsc-users] errors with hypre with MPI and multiple GPUs
> on a node
>
>
> Only for mvapich2-gdr:
>
>
>
> #!/bin/bash
>
> # Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0
> ./launch ./bin
>
>
>
> export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK
>
> case $MV2_COMM_WORLD_LOCAL_RANK in
>
>         [0]) cpus=0-3 ;;
>
>         [1]) cpus=64-67 ;;
>
>         [2]) cpus=72-75 ;;
>
> esac
>
>
>
> numactl --physcpubind=$cpus $@
>
>
>
>

Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node

Reply via email to