Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node

Yesypenko, Anna Thu, 01 Feb 2024 15:31:55 -0800

Hi Victor, Junchao,

Thank you for providing the script, it is very useful!
There are still issues with hypre not binding correctly, and I'm getting the 
error message occasionally (but much less often).
I added some additional environment variables to the script that seem to make 
the behavior more consistent.


export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK    ## as Victor suggested
export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK

The last environment variable is from hypre's documentation on GPUs.
In 30 runs for a small problem size, 4 fail with a hypre-related error. Do you 
have any other thoughts or suggestions?

Best,
Anna

________________________________
From: Victor Eijkhout <eijkh...@tacc.utexas.edu>
Sent: Thursday, February 1, 2024 11:26 AM
To: Junchao Zhang <junchao.zh...@gmail.com>; Yesypenko, Anna 
<a...@oden.utexas.edu>
Cc: petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a 
node


Only for mvapich2-gdr:



#!/bin/bash

# Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 ./launch 
./bin



export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK

case $MV2_COMM_WORLD_LOCAL_RANK in

        [0]) cpus=0-3 ;;

        [1]) cpus=64-67 ;;

        [2]) cpus=72-75 ;;

esac



numactl --physcpubind=$cpus $@

Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node

Reply via email to