On Mittwoch, 22. Februar 2017 22:02:57 CET Tim Biedert wrote:
> > We were having troubles when running with multiple localities up until
> > recently. If you update to today's top of master, at least the problems
> > coming from HPX directly should go away.
> 
> Hi, thanks for the feedback!   Unfortunately, the deadlock remains.
> Any tips how to debug that?   (Admittedly, I haven't figured out yet how
> to interactively attach gdb to a running Slurm multi-node batch job.)

Please also make sure you are not affected by this:
https://github.com/STEllAR-GROUP/hpx/issues/2517
In order to check, run your code with -Ihpx.parcel.message_handlers=0
I got bit by this only at a scale of 256 recently.
If you are running with at least 256 localities (or are using 
hpx::lcos::barrier with more than 256 participating threads), please be aware 
of:
https://github.com/STEllAR-GROUP/hpx/issues/2516

> 
> Interestingly, the deadlock has never occurred when simulating multiple
> localities on a single machine by starting multiple processes using
> mpirun -np.  So it's probably a race condition influenced by additional
> network latency on the cluster, I guess.
> 
> However, I was wondering:  Are there any known issues which might cause
> a (remote) action invocation to stall/deadlock?    I don't know if
> that's any special, but let's say we have a few hundred (or thousand)
> components per locality, which can all communicate
> wildly/asynchronously.  Are there any HPX implementation/concurrency
> limits we might reach?
> 
> 
> 
> On a side note:  When running on a single node, sometimes I get the
> following error for specific extreme cases in my benchmarks:
> 
> {what}: mmap() failed to allocate thread stack due to insufficient
> resources, increase /proc/sys/vm/max_map_count or add
> -Ihpx.stacks.use_guard_pages=0 to the command line: HPX(unhandled_exception)
> 
> I assume it means out of memory. I'm just wondering, because usually
> Slurm kills my job if I exceed the reserved memory amount.
> 
> 
> Best,
> Tim


_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Reply via email to