Hi Kevin,
On 9/3/21 6:26 PM, Kevin Huck wrote:
> Most versions of OTF2 (2.2 and lower, I believe) had an uninitialized
> variable that sometimes led to this error message and prematurely exited
> the initialization process, leading to other problems. Which version of
> OTF2 are you using?
I used 2.2 but now switched to 2.3 and I applied this patch:
--- src/otf2_archive_int.c-org»·2021-09-06 11:27:07.439272261 +0200
+++ src/otf2_archive_int.c»·2021-09-06 11:28:15.735032626 +0200
@@ -1083,7 +1083,7 @@
archive->global_comm_context = globalCommContext;
archive->local_comm_context = localCommContext;
- OTF2_ErrorCode status;
+ OTF2_ErrorCode status = OTF2_SUCCESS;
/* It is time to create the directories by the root rank. */
if ( archive->file_mode == OTF2_FILEMODE_WRITE )
This got rid of the error message, and a trace is now being generated.
Great! But I wonder whether the trace is correct. Vampir reports:
Event matching irregular - 79624 in total
Pending messages - 120 in total
Range violation - 1 in total
I posted a screenshot of the trace here:
https://surfdrive.surf.nl/files/index.php/s/MWbhZFPv733tgMX
I see 8 nested groups of 6 CPU threads, which is good. The numbering /
labeling is weird though. Each group of 6 CPUs is a process running on a
NUMA node.
> It’s possible I have the wrong SLURM environment variables. Could you
> please do something like the following on your system (with a small test
> case) and see what you get?
>
> `srun <srun arguments> env | grep SLURM`
My goal is to trace a job with HPX 8 processes on a single node. This
node contains 8 NUMA nodes, each containing 6 real cores.
salloc --partition=allq --nodes=1 --ntasks=8 --cpus-per-task=12
--cores-per-socket=6 env | grep SLURM
SLURM_SUBMIT_DIR=/quanta1/home/jong0137/development/project/lue
SLURM_SUBMIT_HOST=login01.cluster
SLURM_JOB_ID=3429945
SLURM_JOB_NAME=env
SLURM_JOB_NUM_NODES=1
SLURM_JOB_NODELIST=node008
SLURM_NODE_ALIASES=(null)
SLURM_JOB_PARTITION=allq
SLURM_JOB_CPUS_PER_NODE=96
SLURM_JOBID=3429945
SLURM_NNODES=1
SLURM_NODELIST=node008
SLURM_TASKS_PER_NODE=8
SLURM_JOB_ACCOUNT=depfg
SLURM_JOB_QOS=depfg
SLURM_NTASKS=8
SLURM_NPROCS=8
SLURM_CPUS_PER_TASK=12
SLURM_CLUSTER_NAME=cluster
I use mpirun to start my HPX program. Not use if this is useful, but
these are the MPI variables set:
Each of the 8 processes prints these same values:
OMPI_APP_CTX_NUM_PROCS=8
OMPI_COMM_WORLD_LOCAL_SIZE=8
OMPI_COMM_WORLD_SIZE=8
OMPI_FIRST_RANKS=0
OMPI_UNIVERSE_SIZE=8
These are different per each of the 8 processes:
OMPI_COMM_WORLD_LOCAL_RANK=0
OMPI_COMM_WORLD_NODE_RANK=0
OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=1
OMPI_COMM_WORLD_NODE_RANK=1
OMPI_COMM_WORLD_RANK=1
OMPI_COMM_WORLD_LOCAL_RANK=2
OMPI_COMM_WORLD_NODE_RANK=2
OMPI_COMM_WORLD_RANK=2
OMPI_COMM_WORLD_LOCAL_RANK=3
OMPI_COMM_WORLD_NODE_RANK=3
OMPI_COMM_WORLD_RANK=3
OMPI_COMM_WORLD_LOCAL_RANK=4
OMPI_COMM_WORLD_NODE_RANK=4
OMPI_COMM_WORLD_RANK=4
OMPI_COMM_WORLD_LOCAL_RANK=5
OMPI_COMM_WORLD_NODE_RANK=5
OMPI_COMM_WORLD_RANK=5
OMPI_COMM_WORLD_LOCAL_RANK=6
OMPI_COMM_WORLD_NODE_RANK=6
OMPI_COMM_WORLD_RANK=6
OMPI_COMM_WORLD_LOCAL_RANK=7
OMPI_COMM_WORLD_NODE_RANK=7
OMPI_COMM_WORLD_RANK=7
Thanks for looking into this!
Kor
_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users