Hi Kevin, On 9/10/20 5:30 PM, Kevin Huck wrote:
OTF2 support assumes a shared parallel filesystem, and that the file doesn’t already exist. All processes also need to know their MPI ranks.
One thing that might be relevant is that I launch multiple (8) processes per cluster node (2 in this case).
a) are you on a shared filesystem? You can change the path to the archive with the APEX_OTF2_ARCHIVE_PATH environment variable. (See http://khuck.github.io/xpress-apex/environment/ <http://khuck.github.io/xpress-apex/environment/>)
They are on a filesystem that can be seen and written to from all cluster nodes. I changed the path to the archive to a shared *parallel* filesystem, but that didn't help.
b) does the path already exist? Please delete it before the run
Yes, this was one of the things I tried, but this didn't help. I also tried using relative vs absolute paths for APEX_OTF2_ARCHIVE_PATH.
c) How are you configuring/launching HPX? Is this application launched with mpirun/mpiexec/jsrun/srun?
I use SLURM. When launching a single process all is fine. Launching more results in the problems I reported. My launch script is attached.
d) which version of HPX and/or APEX are you using?
HPX 1.5.0, which uses APEX 2.2.0. KorPS: When using APEX/OTF2 result in errors the resulting process keeps hanging. Without errors, the process finishes. I have had hangs recently as well. No idea whether this is related.
repro-apex.sh
Description: application/shellscript
_______________________________________________ hpx-users mailing list [email protected] https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
