Re: [hpx-users] SLURM: requesting a different number of localities than have been assigned

Tim Biedert Wed, 22 Feb 2017 13:03:11 -0800

We were having troubles when running with multiple localities up until
recently. If you update to today's top of master, at least the problems coming
from HPX directly should go away.

Hi, thanks for the feedback! Unfortunately, the deadlock remains. Any tips how to debug that? (Admittedly, I haven't figured out yet how to interactively attach gdb to a running Slurm multi-node batch job.)

Interestingly, the deadlock has never occurred when simulating multiple localities on a single machine by starting multiple processes using mpirun -np. So it's probably a race condition influenced by additional network latency on the cluster, I guess.

However, I was wondering: Are there any known issues which might cause a (remote) action invocation to stall/deadlock? I don't know if that's any special, but let's say we have a few hundred (or thousand) components per locality, which can all communicate wildly/asynchronously. Are there any HPX implementation/concurrency limits we might reach?

On a side note: When running on a single node, sometimes I get the following error for specific extreme cases in my benchmarks:

{what}: mmap() failed to allocate thread stack due to insufficient resources, increase /proc/sys/vm/max_map_count or add -Ihpx.stacks.use_guard_pages=0 to the command line: HPX(unhandled_exception)

I assume it means out of memory. I'm just wondering, because usually Slurm kills my job if I exceed the reserved memory amount.



Best,
Tim

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] SLURM: requesting a different number of localities than have been assigned

Reply via email to