Hi Michael,

On 01/12/2016 07:54 PM, Michael Levine wrote:
> Thanks Hartmut,
> I've re-built using the master branch and it's significantly better than it 
> had been before.  I am able to successfully run jobs on multiple nodes and 
> with multiple localities, although certain invocations still result in a 
> segmentation fault.
> 
> I haven't had too much time to fully experiment with different combinations 
> -- different number of localities, running on certain nodes and not on 
> others.  My suspicion is that the crash occurs when running on a particular 
> node, although I need to confirm whether or not that is the case.
> 
> I get 2 different error messages, although it is not clear to me yet when 
> and/or why this happens.  Messages follow:
> 
> shmuel@ssh01:~
>> srun -n5 -N2 1d_stencil_7
> src/tcmalloc.cc:278] Attempt to free invalid pointer 0xfffffffffd658be8
> srun: error: hpc02: task 4: Aborted
> src/tcmalloc.cc:278] Attempt to free invalid pointer 0xfffffffffeaecc28
> src/tcmalloc.cc:278] Attempt to free invalid pointer 0xfffffffffed16c28
> src/tcmalloc.cc:278] Attempt to free invalid pointer 0xfffffffffd98ec28
> src/tcmalloc.cc:278] Attempt to free invalid pointer 0xfffffffffe6a8c28
> srun: error: hpc01: tasks 0-3: Aborted
> 
> * * *
> 
> shmuel@ssh01:~
>> srun -n5 -N2 1d_stencil_7
> src/tcmalloc.cc:278] Attempt to free invalid pointer 0xfffffffffe192c28
> srun: error: hpc02: task 4: Aborted
> {stack-trace}: 13 frames:
> 0x7f0244b11c19  : hpx::termination_handler(int) + 0x159 in 
> /usr/local/lib/libhpx.so.0
> 0x7f02416828d0  : ??? + 0x7f02416828d0 in 
> /lib/x86_64-linux-gnu/libpthread.so.0
> 0x7f024578162d  : 
> hpx::util::batch_environments::slurm_environment::retrieve_number_of_localities(bool)
>  + 0x9fd in /usr/local/lib/libhpx.so.0
> 0x7f024577f36c  : 
> hpx::util::batch_environments::slurm_environment::slurm_environment(std::vector<std::string,
>  std::allocator<std::string> >&, bool) + 0x5cc in /usr/local/lib/libhpx.so.0
> 0x7f02457ea283  : 
> hpx::util::batch_environment::batch_environment(std::vector<std::string, 
> std::allocator<std::string> >&, hpx::util::runtime_configuration const&, 
> bool, bool) + 0xf3 in /usr/local/lib/libhpx.so.0
> 0x7f0245870486  : ??? + 0x7f0245870486 in /usr/local/lib/libhpx.so.0
> 0x7f024586ae8d  : ??? + 0x7f024586ae8d in /usr/local/lib/libhpx.so.0
> 0x7f0244b3c23f  : hpx::detail::run_or_start(hpx::util::function<int 
> (boost::program_options::variables_map&), false> const&, 
> boost::program_options::options_description const&, int, char**, 
> std::vector<std::string, std::allocator<std::string> >&&, 
> hpx::util::function<void (), false> const&, hpx::util::function<void (), 
> false> const&, hpx::runtime_mode, bool) + 0x25f in /usr/local/lib/libhpx.so.0
> 0x52d985        : ??? + 0x52d985 in /usr/local/bin/1d_stencil_7
> 0x417cf8        : ??? + 0x417cf8 in /usr/local/bin/1d_stencil_7
> 0x7f023f2a4b45  : __libc_start_main + 0xf5 in /lib/x86_64-linux-gnu/libc.so.6
> 0x417679        : ??? + 0x417679 in /usr/local/bin/1d_stencil_7
> {what}: Segmentation fault
<snip>

This looks odd. It looks like a problem at startup when parsing the
SLURM environment. I never ran into that issue... This sounds like some
strange hickup between different allocators etc. Sometimes, cmake messes
up the installation. Could you please try the same out of a fresh build
directory after you removed the installation?

> 
> * * *
> 
> 
> Further to that point, could you please help me to understand how to attach 
> and use a debugger with the code? 
> 
> shmuel@ssh01:~
>> srun -n6 1d_stencil_7 --hpx:attach-debugger
> PID: 19307 on ssh01.thelevines.ca ready for attaching debugger. Once attached 
> set i = 1 and continue
> PID: 19305 on ssh01.thelevines.ca ready for attaching debugger. Once attached 
> set i = 1 and continue
> PID: 19310 on ssh01.thelevines.ca ready for attaching debugger. Once attached 
> set i = 1 and continue
> PID: 19306 on ssh01.thelevines.ca ready for attaching debugger. Once attached 
> set i = 1 and continue
> PID: 19309 on ssh01.thelevines.ca ready for attaching debugger. Once attached 
> set i = 1 and continue
> PID: 19308 on ssh01.thelevines.ca ready for attaching debugger. Once attached 
> set i = 1 and continue
> 
> I can attach gdb to any of the above processes, but there is no variable 'I' 
> and I can't figure out how to get the code to continue along.  Sorry if this 
> is a stupid/obvious question, but I'm a little stuck and cannot figure out 
> how to move along with it.

There are no stupid questions ;)
Debugging HPX applications is difficult. Once you attached gdb to one of
those processes, it is most likely that you won't see the offending code
just yet.
The thread which has run into an exception (or similar) is waiting in a
"nanosleep", which is part of libc. First thing to do is typing "info
threads". This will give you a list of all operating system threads
running. One (or more) is then sitting in nanosleep. Switch to that
thread with "thread I". Once you are in the correct frame, you can
navigate the frame up until it reaches your code. The variable i is in
the handle_attach_debugger function, which should be easily spottable in
the frame of that function.

Hope I could help

> 
> Thanks again for all your assistance,
> Michael
> 
> 
> _______________________________________________
> hpx-users mailing list
> [email protected]
> https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
> 


-- 
Thomas Heller
Friedrich-Alexander-Universität Erlangen-Nürnberg
Department Informatik - Lehrstuhl Rechnerarchitektur
Martensstr. 3
91058 Erlangen
Tel.: 09131/85-27018
Fax:  09131/85-27912
Email: [email protected]
_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Reply via email to