We were having troubles when running with multiple localities up until recently. If you update to today's top of master, at least the problems coming from HPX directly should go away.
Hi, thanks for the feedback! Unfortunately, the deadlock remains. Any tips how to debug that? (Admittedly, I haven't figured out yet how to interactively attach gdb to a running Slurm multi-node batch job.)
Interestingly, the deadlock has never occurred when simulating multiple localities on a single machine by starting multiple processes using mpirun -np. So it's probably a race condition influenced by additional network latency on the cluster, I guess.
However, I was wondering: Are there any known issues which might cause a (remote) action invocation to stall/deadlock? I don't know if that's any special, but let's say we have a few hundred (or thousand) components per locality, which can all communicate wildly/asynchronously. Are there any HPX implementation/concurrency limits we might reach?
On a side note: When running on a single node, sometimes I get the following error for specific extreme cases in my benchmarks:
{what}: mmap() failed to allocate thread stack due to insufficient resources, increase /proc/sys/vm/max_map_count or add -Ihpx.stacks.use_guard_pages=0 to the command line: HPX(unhandled_exception)
I assume it means out of memory. I'm just wondering, because usually Slurm kills my job if I exceed the reserved memory amount.
Best, Tim
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ hpx-users mailing list [email protected] https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
