Hi Kevin,

On 9/10/20 5:30 PM, Kevin Huck wrote:
OTF2 support assumes a shared parallel filesystem, and that the file doesn’t already exist.  All processes also need to know their MPI ranks.

One thing that might be relevant is that I launch multiple (8) processes per cluster node (2 in this case).

a) are you on a shared filesystem?  You can change the path to the archive with the APEX_OTF2_ARCHIVE_PATH environment variable.  (See http://khuck.github.io/xpress-apex/environment/ <http://khuck.github.io/xpress-apex/environment/>)

They are on a filesystem that can be seen and written to from all cluster nodes. I changed the path to the archive to a shared *parallel* filesystem, but that didn't help.

b) does the path already exist?  Please delete it before the run

Yes, this was one of the things I tried, but this didn't help. I also tried using relative vs absolute paths for APEX_OTF2_ARCHIVE_PATH.

c) How are you configuring/launching HPX?  Is this application launched with mpirun/mpiexec/jsrun/srun?

I use SLURM. When launching a single process all is fine. Launching more results in the problems I reported. My launch script is attached.

d) which version of HPX and/or APEX are you using?

HPX 1.5.0, which uses APEX 2.2.0.

Kor

PS: When using APEX/OTF2 result in errors the resulting process keeps hanging. Without errors, the process finishes. I have had hangs recently as well. No idea whether this is related.

Attachment: repro-apex.sh
Description: application/shellscript

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Reply via email to