Hi Kevin, Thank you for your explanation. I now better understand how I should read the Vampir trace.
You say "I assume 1 process per physical node". My trace involved 8 processes on a single node. Maybe that explains the messages Vampir throws at me: >> Event matching irregular - 79624 in total >> Pending messages - 120 in total >> Range violation - 1 in total I will try again with using a single process per node -- using multiple nodes. BTW, being able to trace tasks this way is very useful! I don't know who is / are responsible for the APEX + HPX integration, but I think it is great. Best regards, Kor On 9/17/21 12:09 AM, Kevin Huck wrote: > Kor - > > Sorry I didn’t reply sooner… I’m glad things are working for you now! > > The thread naming is a bit odd, because Vampir changed how they display > the names (from version 8 to version 9, I think), and I haven’t really > tried that hard to make sure that the names accurately reflect the > physical hardware. But the process/thread hierarchy is correct, even if > the naming looks odd. For example: > > CPU thread 1:1 - this is the main thread of the program, although HPX > doesn’t use it to execute tasks. APEX attributes all communication to > this thread. > CPU thread 2:1 - this is the first worker thread spawned by HPX. > CPU thread 4:1 - this is the second worker thread spawned by HPX. APEX > has numbered it oddly, because thread 3 is internal to APEX. > CPU thread 5:1 - etc. > CPU thread 6:1 > CPU thread 7:1 > CPU thread 8:1 > > I hope that explains things. I don’t use “hwloc” or any library like > that to construct a perfectly accurate system hardware hierarchy, > because it hasn’t been worth the effort. For the tracing, I assume 1 > process per physical node, and as long as the OS processes and threads > are annotated, it works. Just wait until you see how I annotate the GPU > threads… 🙃 (it’s actually not that bad) > > Thanks - > Kevin > >> On Sep 6, 2021, at 10:55 AM, Kor de Jong <k.dejo...@uu.nl >> <mailto:k.dejo...@uu.nl>> wrote: >> >> [I sent the message below to the HPX mailinglist and forgot to cc you.] >> >> >> Hi Kevin, >> >> On 9/3/21 6:26 PM, Kevin Huck wrote: >>> Most versions of OTF2 (2.2 and lower, I believe) had an uninitialized >>> variable that sometimes led to this error message and prematurely >>> exited the initialization process, leading to other problems. Which >>> version of OTF2 are you using? >> >> I used 2.2 but now switched to 2.3 and I applied this patch: >> >> --- src/otf2_archive_int.c-org»·2021-09-06 11:27:07.439272261 +0200 >> +++ src/otf2_archive_int.c»·2021-09-06 11:28:15.735032626 +0200 >> @@ -1083,7 +1083,7 @@ >> archive->global_comm_context = globalCommContext; >> archive->local_comm_context = localCommContext; >> >> - OTF2_ErrorCode status; >> + OTF2_ErrorCode status = OTF2_SUCCESS; >> >> /* It is time to create the directories by the root rank. */ >> if ( archive->file_mode == OTF2_FILEMODE_WRITE ) >> >> This got rid of the error message, and a trace is now being generated. >> Great! But I wonder whether the trace is correct. Vampir reports: >> >> Event matching irregular - 79624 in total >> Pending messages - 120 in total >> Range violation - 1 in total >> >> I posted a screenshot of the trace here: >> >> https://surfdrive.surf.nl/files/index.php/s/MWbhZFPv733tgMX >> <https://surfdrive.surf.nl/files/index.php/s/MWbhZFPv733tgMX> >> >> I see 8 nested groups of 6 CPU threads, which is good. The numbering / >> labeling is weird though. Each group of 6 CPUs is a process running on >> a NUMA node. >> >>> It’s possible I have the wrong SLURM environment variables. Could >>> you please do something like the following on your system (with a >>> small test case) and see what you get? >>> `srun <srun arguments> env | grep SLURM` >> >> My goal is to trace a job with HPX 8 processes on a single node. This >> node contains 8 NUMA nodes, each containing 6 real cores. >> >> salloc --partition=allq --nodes=1 --ntasks=8 --cpus-per-task=12 >> --cores-per-socket=6 env | grep SLURM >> >> SLURM_SUBMIT_DIR=/quanta1/home/jong0137/development/project/lue >> SLURM_SUBMIT_HOST=login01.cluster >> SLURM_JOB_ID=3429945 >> SLURM_JOB_NAME=env >> SLURM_JOB_NUM_NODES=1 >> SLURM_JOB_NODELIST=node008 >> SLURM_NODE_ALIASES=(null) >> SLURM_JOB_PARTITION=allq >> SLURM_JOB_CPUS_PER_NODE=96 >> SLURM_JOBID=3429945 >> SLURM_NNODES=1 >> SLURM_NODELIST=node008 >> SLURM_TASKS_PER_NODE=8 >> SLURM_JOB_ACCOUNT=depfg >> SLURM_JOB_QOS=depfg >> SLURM_NTASKS=8 >> SLURM_NPROCS=8 >> SLURM_CPUS_PER_TASK=12 >> SLURM_CLUSTER_NAME=cluster >> >> >> I use mpirun to start my HPX program. Not use if this is useful, but >> these are the MPI variables set: >> >> Each of the 8 processes prints these same values: >> >> OMPI_APP_CTX_NUM_PROCS=8 >> OMPI_COMM_WORLD_LOCAL_SIZE=8 >> OMPI_COMM_WORLD_SIZE=8 >> OMPI_FIRST_RANKS=0 >> OMPI_UNIVERSE_SIZE=8 >> >> These are different per each of the 8 processes: >> >> OMPI_COMM_WORLD_LOCAL_RANK=0 >> OMPI_COMM_WORLD_NODE_RANK=0 >> OMPI_COMM_WORLD_RANK=0 >> >> OMPI_COMM_WORLD_LOCAL_RANK=1 >> OMPI_COMM_WORLD_NODE_RANK=1 >> OMPI_COMM_WORLD_RANK=1 >> >> OMPI_COMM_WORLD_LOCAL_RANK=2 >> OMPI_COMM_WORLD_NODE_RANK=2 >> OMPI_COMM_WORLD_RANK=2 >> >> OMPI_COMM_WORLD_LOCAL_RANK=3 >> OMPI_COMM_WORLD_NODE_RANK=3 >> OMPI_COMM_WORLD_RANK=3 >> >> OMPI_COMM_WORLD_LOCAL_RANK=4 >> OMPI_COMM_WORLD_NODE_RANK=4 >> OMPI_COMM_WORLD_RANK=4 >> >> OMPI_COMM_WORLD_LOCAL_RANK=5 >> OMPI_COMM_WORLD_NODE_RANK=5 >> OMPI_COMM_WORLD_RANK=5 >> >> OMPI_COMM_WORLD_LOCAL_RANK=6 >> OMPI_COMM_WORLD_NODE_RANK=6 >> OMPI_COMM_WORLD_RANK=6 >> >> OMPI_COMM_WORLD_LOCAL_RANK=7 >> OMPI_COMM_WORLD_NODE_RANK=7 >> OMPI_COMM_WORLD_RANK=7 >> >> >> Thanks for looking into this! >> >> Kor >> > > -- > Kevin Huck, PhD > Research Associate / Computer Scientist > OACISS - Oregon Advanced Computing Institute for Science and Society > University of Oregon > kh...@cs.uoregon.edu <mailto:kh...@cs.uoregon.edu> > http://tau.uoregon.edu > http://oaciss.uoregon.edu <http://oaciss.uoregon.edu> > > _______________________________________________ hpx-users mailing list hpx-users@stellar-group.org https://mail.cct.lsu.edu/mailman/listinfo/hpx-users