Re: [hpx-users] SLURM: requesting a different number of localities than have been assigned

Thomas Heller Wed, 22 Feb 2017 13:22:05 -0800

On Mittwoch, 22. Februar 2017 22:02:57 CET Tim Biedert wrote:
> > We were having troubles when running with multiple localities up until
> > recently. If you update to today's top of master, at least the problems
> > coming from HPX directly should go away.
> 
> Hi, thanks for the feedback!   Unfortunately, the deadlock remains.
> Any tips how to debug that?   (Admittedly, I haven't figured out yet how
> to interactively attach gdb to a running Slurm multi-node batch job.)


If you are able to log on to a node via ssh, then this should be trivial when 
using --hpx:attach-debugger=exception. Other options are gdbserver, or proper 
parallel debuggers like allinea ddt. I am nowadays using gdbserver + qtcreator 
on my small scale development system and mostly ddt for larger scale machines.
When your program hangs, you can look into the threadmanagers scheduler 
(accessible over the runtime ptr) and in the different pools there, where the 
thread_map_ will hold the different HPX threads that are alive. This probably 
sounds very confusing to you right now ... I really need to properly write 
that up once ...

> 
> Interestingly, the deadlock has never occurred when simulating multiple
> localities on a single machine by starting multiple processes using
> mpirun -np.  So it's probably a race condition influenced by additional
> network latency on the cluster, I guess.

Could be. Distributed, asynchronous communication is hard and very tricky to 
debug. First of all, you should try to figure out which part of your code 
blocks (cout debugging, outputting the necessary information like where am I, 
where did I come from etc.). 
Timing is usually very crucial for those deadlocks to appear ...

Sometimes, running with --hpx:debug-hpx-log helps as well.

> 
> However, I was wondering:  Are there any known issues which might cause
> a (remote) action invocation to stall/deadlock?    I don't know if
> that's any special, but let's say we have a few hundred (or thousand)
> components per locality, which can all communicate
> wildly/asynchronously.  Are there any HPX implementation/concurrency
> limits we might reach?

Unlikely. It might mean that your code is very slow, that is eventually 
finishes (maybe let it run over night?).

> 
> 
> 
> On a side note:  When running on a single node, sometimes I get the
> following error for specific extreme cases in my benchmarks:
> 
> {what}: mmap() failed to allocate thread stack due to insufficient
> resources, increase /proc/sys/vm/max_map_count or add
> -Ihpx.stacks.use_guard_pages=0 to the command line: HPX(unhandled_exception)
> 
> I assume it means out of memory. I'm just wondering, because usually
> Slurm kills my job if I exceed the reserved memory amount.

It's not exactly out of memory. If you pass the suggested command line 
parameter you should be fine (note to self: add documentation about this).
It's mostly a limitation of the linux kernel not being able to have a rather 
small limit on available pages per process (or more precisely, page 
descriptors).

> 
> 
> Best,
> Tim


_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] SLURM: requesting a different number of localities than have been assigned

Reply via email to