Re: [hpx-users] Generate OTF2 traces of distributed runs with APEX, re-revisited

Kor de Jong Mon, 06 Sep 2021 08:44:48 -0700

Hi Kevin,

On 9/3/21 6:26 PM, Kevin Huck wrote:
> Most versions of OTF2 (2.2 and lower, I believe) had an uninitialized 
> variable that sometimes led to this error message and prematurely exited 
> the initialization process, leading to other problems.  Which version of 
> OTF2 are you using?


I used 2.2 but now switched to 2.3 and I applied this patch:

--- src/otf2_archive_int.c-org»·2021-09-06 11:27:07.439272261 +0200
+++ src/otf2_archive_int.c»·2021-09-06 11:28:15.735032626 +0200
@@ -1083,7 +1083,7 @@
       archive->global_comm_context  = globalCommContext;
       archive->local_comm_context   = localCommContext;

-    OTF2_ErrorCode status;
+    OTF2_ErrorCode status = OTF2_SUCCESS;

       /* It is time to create the directories by the root rank. */
       if ( archive->file_mode == OTF2_FILEMODE_WRITE )

This got rid of the error message, and a trace is now being generated. 
Great! But I wonder whether the trace is correct. Vampir reports:

Event matching irregular - 79624 in total
Pending messages - 120 in total
Range violation - 1 in total

I posted a screenshot of the trace here:

https://surfdrive.surf.nl/files/index.php/s/MWbhZFPv733tgMX

I see 8 nested groups of 6 CPU threads, which is good. The numbering / 
labeling is weird though. Each group of 6 CPUs is a process running on a 
NUMA node.

>   It’s possible I have the wrong SLURM environment variables. Could you 
> please do something like the following on your system (with a small test 
> case) and see what you get?
> 
> `srun <srun arguments> env | grep SLURM`

My goal is to trace a job with HPX 8 processes on a single node. This 
node contains 8 NUMA nodes, each containing 6 real cores.

salloc --partition=allq --nodes=1 --ntasks=8 --cpus-per-task=12 
--cores-per-socket=6 env | grep SLURM

SLURM_SUBMIT_DIR=/quanta1/home/jong0137/development/project/lue
SLURM_SUBMIT_HOST=login01.cluster
SLURM_JOB_ID=3429945
SLURM_JOB_NAME=env
SLURM_JOB_NUM_NODES=1
SLURM_JOB_NODELIST=node008
SLURM_NODE_ALIASES=(null)
SLURM_JOB_PARTITION=allq
SLURM_JOB_CPUS_PER_NODE=96
SLURM_JOBID=3429945
SLURM_NNODES=1
SLURM_NODELIST=node008
SLURM_TASKS_PER_NODE=8
SLURM_JOB_ACCOUNT=depfg
SLURM_JOB_QOS=depfg
SLURM_NTASKS=8
SLURM_NPROCS=8
SLURM_CPUS_PER_TASK=12
SLURM_CLUSTER_NAME=cluster


I use mpirun to start my HPX program. Not use if this is useful, but 
these are the MPI variables set:

Each of the 8 processes prints these same values:

OMPI_APP_CTX_NUM_PROCS=8
OMPI_COMM_WORLD_LOCAL_SIZE=8
OMPI_COMM_WORLD_SIZE=8
OMPI_FIRST_RANKS=0
OMPI_UNIVERSE_SIZE=8

These are different per each of the 8 processes:

OMPI_COMM_WORLD_LOCAL_RANK=0
OMPI_COMM_WORLD_NODE_RANK=0
OMPI_COMM_WORLD_RANK=0

OMPI_COMM_WORLD_LOCAL_RANK=1
OMPI_COMM_WORLD_NODE_RANK=1
OMPI_COMM_WORLD_RANK=1

OMPI_COMM_WORLD_LOCAL_RANK=2
OMPI_COMM_WORLD_NODE_RANK=2
OMPI_COMM_WORLD_RANK=2

OMPI_COMM_WORLD_LOCAL_RANK=3
OMPI_COMM_WORLD_NODE_RANK=3
OMPI_COMM_WORLD_RANK=3

OMPI_COMM_WORLD_LOCAL_RANK=4
OMPI_COMM_WORLD_NODE_RANK=4
OMPI_COMM_WORLD_RANK=4

OMPI_COMM_WORLD_LOCAL_RANK=5
OMPI_COMM_WORLD_NODE_RANK=5
OMPI_COMM_WORLD_RANK=5

OMPI_COMM_WORLD_LOCAL_RANK=6
OMPI_COMM_WORLD_NODE_RANK=6
OMPI_COMM_WORLD_RANK=6

OMPI_COMM_WORLD_LOCAL_RANK=7
OMPI_COMM_WORLD_NODE_RANK=7
OMPI_COMM_WORLD_RANK=7


Thanks for looking into this!

Kor

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] Generate OTF2 traces of distributed runs with APEX, re-revisited

Reply via email to