Hi,
I've been experimenting with hpx for a hobby project, with a small virtualized cluster of 3 debian 8.2 machines running on esxi server. When I run any hpx code on a single locality, it appears to be working; however, whenever I try to use more than one locality, I invariably get a segmentation fault, regardless of which code I am using. I first encountered the trouble with my own code, but it also happens when running any of the example apps as well. I am somewhat new to all of this and I cannot figure out how to attach a debugger to try and identify the cause of these errors.
I'm using hpx 0.9.11 on my small cluster using the latest version of slurm . (I chose slurm as it appears to provide support for Intel Phi nodes running applications in native mode). To the best of my knowledge, slurm is configured correctly. However, it is certainly possible that I have done something wrong configuring slurm.
I have tried using boost 1.58, 1.59, and 1.60. I have tried with clang 3.7 and with Intel C++ 16 and 16 update 1. In all cases, I get the same segmentation fault whenever I try and run on more than a single locality. I have played around with single vs. multiple network interfaces, single vs. multiple networks, etc.
Lately, I have re-built boost using the Intel compiler to ensure that there was no issue caused by hpx and boost having been compiled with different compilers. I have been trying to troubleshoot this only based on the example code, rather than my own code, so that I can be confident that the problems are not caused by my own code errors/bugs.
I know there is a command-line option to attach a debugger but I cannot figure out how to use this.
I’ve attached a copy of my slurm.conf for reference, and the output of --hpx:dump-config and --hpx:debug-clp
HPX stack trace / complete error message is copied below.
I’m really stuck here and honestly have no idea how to resolve this issue. I greatly appreciate any help that you can offer. Furthermore, I’d really appreciate some guidance as to how to use a debugger to debug my own hpx code to identify and resolve issues with that code. Please let me know if there’s any additional information that I should provide.
Thank you very much in advance,
Shmuel
shmuel@ssh01:/usr/local/lib
> srun -n1 -N1 1d_stencil_7
Localities,OS_Threads,Execution_Time_sec,Points_per_Partition,Partitions,Time_Steps
1, 1, 0.093138849, 10, 10, 45
shmuel@ssh01:~
> srun -n2 -N1 1d_stencil_7
{stack-trace}: 4 frames:
0x7f09a45e9840 : hpx::detail::backtrace(unsigned long) + 0x80 in /usr/local/lib/libhpx.so.0
0x7f09a45eeced : boost::exception_ptr hpx::detail::get_exception<hpx::exception>(hpx::exception const&, std::string const&, std::string const&, long, std::string const&) + 0x23d in /usr/local/lib/libhpx.so.0
0x7f09a45ee8bc : void hpx::detail::throw_exception<hpx::exception>(hpx::exception const&, std::string const&, std::string const&, long) + 0x10c in /usr/local/lib/libhpx.so.0
0x7f09a4a0de3d : hpx::agas::server::primary_namespace::resolve_free_list(boost::unique_lock<hpx::lcos::local::spinlock>&, std::list<std::_Rb_tree_iterator<std::pair<hpx::naming::gid_type const, long> >, std::allocator<std::_Rb_tree_iterator<std::pair<hpx::naming::gid_type const, long> > > > const&, std::list<hpx::agas::server::primary_namespace::free_entry, std::allocator<hpx::agas::server::primary_namespace::free_entry> >&, hpx::naming::gid_type const&, hpx::naming::gid_type const&, hpx::error_code&) + 0x137d in /usr/local/lib/libhpx.so.0
{env}: 85 entries:
ALTERNATE_EDITOR=
CPATH=/opt/intel/compilers_and_libraries_2016.1.150/linux/mkl/include
CXX=clang++
DIRHISTORY_SIZE=30
DISPLAY=localhost:10.0
EDITOR=/usr/bin/vim
FPATH=/home/shmuel/.oh-my-zsh/plugins/wd:/home/shmuel/.oh-my-zsh/plugins/tmux:/home/shmuel/.oh-my-zsh/plugins/dirhistory:/home/shmuel/.oh-my-zsh/plugins/colorize:/home/shmuel/.oh-my-zsh/plugins/history:/home/shmuel/.oh-my-zsh/plugins/sudo:/home/shmuel/.oh-my-zsh/plugins/command-not-found:/home/shmuel/.oh-my-zsh/plugins/tmux:/home/shmuel/.oh-my-zsh/plugins/mosh:/home/shmuel/.oh-my-zsh/plugins/git-extras:/home/shmuel/.oh-my-zsh/plugins/battery:/home/shmuel/.oh-my-zsh/plugins/git-flow-avh:/home/shmuel/.oh-my-zsh/plugins/git:/home/shmuel/.oh-my-zsh/functions:/home/shmuel/.oh-my-zsh/completions:/usr/local/share/zsh/site-functions:/usr/share/zsh/vendor-functions:/usr/share/zsh/vendor-completions:/usr/share/zsh/functions/Calendar:/usr/share/zsh/functions/Chpwd:/usr/share/zsh/functions/Completion:/usr/share/zsh/functions/Completion/AIX:/usr/share/zsh/functions/Completion/BSD:/usr/share/zsh/functions/Completion/Base:/usr/share/zsh/functions/Completion/Cygwin:/usr/share/zsh/functions/Completion/Darwin:/usr/share/zsh/functions/Completion/Debian:/usr/share/zsh/functions/Completion/Linux:/usr/share/zsh/functions/Completion/Mandriva:/usr/share/zsh/functions/Completion/Redhat:/usr/share/zsh/functions/Completion/Solaris:/usr/share/zsh/functions/Completion/Unix:/usr/share/zsh/functions/Completion/X:/usr/share/zsh/functions/Completion/Zsh:/usr/share/zsh/functions/Completion/openSUSE:/usr/share/zsh/functions/Exceptions:/usr/share/zsh/functions/MIME:/usr/share/zsh/functions/Misc:/usr/share/zsh/functions/Newuser:/usr/share/zsh/functions/Prompts:/usr/share/zsh/functions/TCP:/usr/share/zsh/functions/VCS_Info:/usr/share/zsh/functions/VCS_Info/Backends:/usr/share/zsh/functions/Zftp:/usr/share/zsh/functions/Zle:/home/shmuel/bin/funcs
GDBSERVER_MIC=/opt/intel/debugger_2016/gdb/targets/mic/bin/gdbserver
GDB_CROSS=/opt/intel/debugger_2016/gdb/intel64_mic/bin/gdb-mic
HOME=/home/shmuel
INFOPATH=/opt/intel/documentation_2016/en/debugger//gdb-ia/info/:/opt/intel/documentation_2016/en/debugger//gdb-mic/info/:/opt/intel/documentation_2016/en/debugger//gdb-igfx/info/
INTEL_LICENSE_FILE=/opt/intel/compilers_and_libraries_2016.1.150/linux/licenses:/opt/intel/licenses:/home/shmuel/intel/licenses
INTEL_PYTHONHOME=/opt/intel/debugger_2016/python/intel64/
I_MPI_ROOT=/opt/intel/compilers_and_libraries_2016.1.150/linux/mpi
LANG=en_US.utf8
LANGUAGE=en_CA:en
LC_ALL=en_CA.UTF-8
LC_CTYPE=en_CA.UTF-8
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2016.1.150/linux/compiler/lib/intel64:/opt/intel/compilers_and_libraries_2016.1.150/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2016.1.150/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2016.1.150/linux/tbb/lib/intel64/gcc4.4:/opt/intel/compilers_and_libraries_2016.1.150/linux/compiler/lib/intel64:/opt/intel/compilers_and_libraries_2016.1.150/linux/mkl/lib/intel64:/opt/intel/debugger_2016/libipt/intel64/lib:/home/shmuel/src/fx/lib/:
LESS=-R
LIBRARY_PATH=/opt/intel/compilers_and_libraries_2016.1.150/linux/tbb/lib/intel64/gcc4.4:/opt/intel/compilers_and_libraries_2016.1.150/linux/compiler/lib/intel64:/opt/intel/compilers_and_libraries_2016.1.150/linux/mkl/lib/intel64
LOGNAME=shmuel
LSCOLORS=Gxfxcxdxbxegedabagacad
MAIL=/var/mail/shmuel
MANPATH=/opt/intel/man/common:/opt/intel/compilers_and_libraries_2016.1.150/linux/mpi/man:/opt/intel/compilers_and_libraries_2016.1.150/linux/man/en_US:/opt/intel/documentation_2016/en/debugger//gdb-ia/man/:/opt/intel/documentation_2016/en/debugger//gdb-mic/man/:/opt/intel/documentation_2016/en/debugger//gdb-igfx/man/::/home/shmuel/src/tup/
MIC_LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2016.1.150/linux/compiler/lib/mic:/opt/intel/compilers_and_libraries_2016.1.150/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2016.1.150/linux/tbb/lib/mic:/opt/intel/compilers_and_libraries_2016.1.150/linux/compiler/lib/mic:/opt/intel/compilers_and_libraries_2016.1.150/linux/mkl/lib/mic
MIC_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2016.1.150/linux/compiler/lib/mic:/opt/intel/compilers_and_libraries_2016.1.150/linux/mpi/mic/lib
MKLROOT=/opt/intel/compilers_and_libraries_2016.1.150/linux/mkl
MPM_LAUNCHER=/opt/intel/debugger_2016/mpm/mic/bin/start_mpm.sh
NLSPATH=/opt/intel/compilers_and_libraries_2016.1.150/linux/compiler/lib/intel64/locale/%l_%t/%N:/opt/intel/compilers_and_libraries_2016.1.150/linux/mkl/lib/intel64/locale/%l_%t/%N:/opt/intel/debugger_2016/gdb/intel64_mic/share/locale/%l_%t/%N:/opt/intel/debugger_2016/gdb/intel64/share/locale/%l_%t/%N
OLDPWD=/home/shmuel
PAGER=less
PATH=/opt/intel/compilers_and_libraries_2016.1.150/linux/bin/intel64:/opt/intel/compilers_and_libraries_2016.1.150/linux/mpi/intel64/bin:/opt/intel/debugger_2016/gdb/intel64_mic/bin:/usr/local/texlive/2014/bin/x86_64-linux:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/local/sbin:/sbin:/usr/local/games:/home/shmuel/src/tup/
PWD=/home/shmuel
REPORTTIME=2
SHELL=/bin/zsh
SHLVL=1
SLURMD_NODENAME=hpc02
SLURM_CHECKPOINT_IMAGE_DIR=/var/lib/slurm-llnl/checkpoint
SLURM_CLUSTER_NAME=cluster
SLURM_CPUS_ON_NODE=2
SLURM_DISTRIBUTION=block
SLURM_GTIDS=0,1
SLURM_JOBID=2165
SLURM_JOB_CPUS_PER_NODE=2
SLURM_JOB_ID=2165
SLURM_JOB_NAME=1d_stencil_7
SLURM_JOB_NODELIST=hpc02
SLURM_JOB_NUM_NODES=1
SLURM_JOB_PARTITION=debug
SLURM_JOB_UID=1000
SLURM_JOB_USER=shmuel
SLURM_LAUNCH_NODE_IPADDR=192.168.1.125
SLURM_LOCALID=1
SLURM_NNODES=1
SLURM_NODEID=0
SLURM_NODELIST=hpc02
SLURM_NPROCS=2
SLURM_NTASKS=2
SLURM_PRIO_PROCESS=0
SLURM_PROCID=1
SLURM_SRUN_COMM_HOST=192.168.1.125
SLURM_SRUN_COMM_PORT=45712
SLURM_STEPID=0
SLURM_STEP_ID=0
SLURM_STEP_LAUNCHER_PORT=45712
SLURM_STEP_NODELIST=hpc02
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=2
SLURM_STEP_TASKS_PER_NODE=2
SLURM_SUBMIT_DIR=/home/shmuel
SLURM_SUBMIT_HOST=ssh01.thelevines.ca
SLURM_TASKS_PER_NODE=2
SLURM_TASK_PID=18855
SLURM_TOPOLOGY_ADDR=hpc02
SLURM_TOPOLOGY_ADDR_PATTERN=node
SRUN_DEBUG=3
SSH_CLIENT=193.90.12.86 38280 22
SSH_CONNECTION=193.90.12.86 38280 192.168.1.125 22
SSH_TTY=/dev/pts/0
TERM=xterm
USER=shmuel
ZSH_TMUX_TERM=screen
_=/usr/local/bin/srun
_ZSH_TMUX_FIXED_CONFIG=/home/shmuel/.oh-my-zsh/plugins/tmux/tmux.only.conf
{locality-id}: 1
{hostname}: [ (tcp:192.168.1.72:7911) ]
{process-id}: 18855
{function}: primary_namespace::resolve_free_list
{file}: /usr/src/hpx/src/runtime/agas/server/primary_namespace_server.cpp
{line}: 1021
{os-thread}: 0, worker-thread#0
{thread-id}: 00000000020813c0
{thread-description}: <unknown>
{state}: state_running
{auxinfo}:
{config}:
HPX_HAVE_NATIVE_TLS=ON
HPX_HAVE_STACKTRACES=ON
HPX_HAVE_COMPRESSION_BZIP2=OFF
HPX_HAVE_COMPRESSION_SNAPPY=OFF
HPX_HAVE_COMPRESSION_ZLIB=OFF
HPX_HAVE_PARCEL_COALESCING=ON
HPX_HAVE_PARCELPORT_TCP=ON
HPX_HAVE_PARCELPORT_MPI=OFF
HPX_HAVE_PARCELPORT_IPC=OFF
HPX_HAVE_PARCELPORT_IBVERBS=OFF
HPX_HAVE_VERIFY_LOCKS=OFF
HPX_HAVE_HWLOC=ON
HPX_HAVE_ITTNOTIFY=OFF
HPX_HAVE_RUN_MAIN_EVERYWHERE=OFF
HPX_LIMIT=5
HPX_PARCEL_MAX_CONNECTIONS=512
HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
HPX_INITIAL_AGAS_LOCAL_CACHE_SIZE=256
HPX_AGAS_LOCAL_CACHE_SIZE_PER_THREAD=32
HPX_HAVE_MALLOC=tcmalloc
HPX_PREFIX (configured)=/usr/local
HPX_PREFIX=/usr/local
{version}: V0.9.11 (AGAS: V3.0), Git: 4c96a9b3b3
{boost}: V1.60.0
{build-type}: release
{date}: Jan 3 2016 23:53:54
{platform}: linux
{compiler}: Intel C++ C++0x mode version 1600
{stdlib}: GNU libstdc++ version 20141220
{what}: primary_namespace::resolve_free_list, failed to resolve gid, gid({0000000200000001, 0000000000001002}): HPX(internal_server_error)
{stack-trace}: 2 frames:
0x7f09a4670d79 : hpx::termination_handler(int) + 0x159 in /usr/local/lib/libhpx.so.0
0x7f09a11e78d0 : ??? + 0x7f09a11e78d0 in /lib/x86_64-linux-gnu/libpthread.so.0
{what}: Segmentation fault
{config}:
HPX_HAVE_NATIVE_TLS=ON
HPX_HAVE_STACKTRACES=ON
HPX_HAVE_COMPRESSION_BZIP2=OFF
HPX_HAVE_COMPRESSION_SNAPPY=OFF
HPX_HAVE_COMPRESSION_ZLIB=OFF
HPX_HAVE_PARCEL_COALESCING=ON
HPX_HAVE_PARCELPORT_TCP=ON
HPX_HAVE_PARCELPORT_MPI=OFF
HPX_HAVE_PARCELPORT_IPC=OFF
HPX_HAVE_PARCELPORT_IBVERBS=OFF
HPX_HAVE_VERIFY_LOCKS=OFF
HPX_HAVE_HWLOC=ON
HPX_HAVE_ITTNOTIFY=OFF
HPX_HAVE_RUN_MAIN_EVERYWHERE=OFF
HPX_LIMIT=5
HPX_PARCEL_MAX_CONNECTIONS=512
HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
HPX_INITIAL_AGAS_LOCAL_CACHE_SIZE=256
HPX_AGAS_LOCAL_CACHE_SIZE_PER_THREAD=32
HPX_HAVE_MALLOC=tcmalloc
HPX_PREFIX (configured)=/usr/local
HPX_PREFIX=/usr/local
{version}: V0.9.11 (AGAS: V3.0), Git: 4c96a9b3b3
{boost}: V1.60.0
{build-type}: release
{date}: Jan 3 2016 23:53:54
{platform}: linux
{compiler}: Intel C++ C++0x mode version 1600
{stdlib}: GNU libstdc++ version 20141220
srun: error: hpc02: task 1: Aborted