Hi Victor, Junchao,

Thank you for providing the script, it is very useful!
There are still issues with hypre not binding correctly, and I'm getting the 
error message occasionally (but much less often).
I added some additional environment variables to the script that seem to make 
the behavior more consistent.

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK    ## as Victor suggested
export HYPRE_MEMORY_DEVICE=$MV2_COMM_WORLD_LOCAL_RANK

The last environment variable is from hypre's documentation on GPUs.
In 30 runs for a small problem size, 4 fail with a hypre-related error. Do you 
have any other thoughts or suggestions?

Best,
Anna

________________________________
From: Victor Eijkhout <eijkh...@tacc.utexas.edu>
Sent: Thursday, February 1, 2024 11:26 AM
To: Junchao Zhang <junchao.zh...@gmail.com>; Yesypenko, Anna 
<a...@oden.utexas.edu>
Cc: petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a 
node


Only for mvapich2-gdr:



#!/bin/bash

# Usage: mpirun -n <num_proc> MV2_USE_AFFINITY=0 MV2_ENABLE_AFFINITY=0 ./launch 
./bin



export CUDA_VISIBLE_DEVICES=$MV2_COMM_WORLD_LOCAL_RANK

case $MV2_COMM_WORLD_LOCAL_RANK in

        [0]) cpus=0-3 ;;

        [1]) cpus=64-67 ;;

        [2]) cpus=72-75 ;;

esac



numactl --physcpubind=$cpus $@


Reply via email to