Quick update-

> 
> This looks odd. It looks like a problem at startup when parsing the SLURM
> environment. I never ran into that issue... This sounds like some strange
> hickup between different allocators etc. Sometimes, cmake messes up the
> installation. Could you please try the same out of a fresh build directory
after
> you removed the installation?

I've completely removed all traces of the old installation from each of the
3 machines and rebuilt the code on all 3 systems, with Intel compiler v16
update 1.  The only difference from previous builds was changing the install
prefix from /usr/local to a different directory, to make it easier to
isolate all files in the future.  I also built the code on all 3 systems
rather than copying the binary files from 1 system to the others, just to
try and eliminate any possible causes for the errors.

For reference, the build command used was as follows:
> cmake -DHPX_WITH_MALLOC=tcmalloc -DCMAKE_C_COMPILER=$(which icc)
-DCMAKE_CXX_COMPILER=$(which icpc) -DHPX_WTIH_TESTS=NO
-DHPX_WITH_EXAMPLES=YES -DHPX_WITH_TESTS_REGRESSIONS=NO
-DHPX_WITH_TESTS_UNIT=NO -DHPX_WITH_TESTS_EXTERNAL_BUILD=NO
-DBOOST_ROOT=/usr/src/boost_1_60_0 -DHPX_WITH_COMPRESSION_SNAPPY=TRUE
-DHPX_WITH_PARCELPORT_MPI=TRUE -DHPX_WITH_PARCELPORT_TCP=TRUE
-DHPX_WITH_ITTNOTIFY=TRUE -DMPI_C_COMPILER=$(which mpicc)
-DMPI_CXX_COMPILER=$(which mpicxx) -DCMAKE_INSTALL_PREFIX=/opt/hpx/0.9.12
../ 

Running the code gives me the following results:
1. Running the example on each individual system works fine
2. Using SLURM, running the example on a single node works fine, regardless
of quantity of 'tasks' to create.  I've tested this for each of the 3 nodes:
srun -N1 -nX 1d_stencil_8, where X = {1..4} for hpc01 and hpc02 and X =
{1..8} for ssh01.
3. Using SLURM, running the example on a single node works fine, regardless
of quantity of 'cpus per task' to create.  I've tested this for each of the
3 nodes:
srun -N1 -cX 1d_stencil_8, where X = {1..4} for hpc01 and hpc02 and X =
{1..8} for ssh01.
4. Running the code on multiple nodes: works fine if n == N, otherwise fails
with either a segmentation fault and stack trace or " src/tcmalloc.cc:278]
Attempt to free invalid pointer 0xfffffffffd8a8be8"
5. When n == N, the code runs successfully for any valid value of -c (i.e.
ssh01 has 8 vCPUs, hpc0[1,2] has 4 vCPUs)
6. If I try to allow overcommitting of resources (-O argument to srun), hpx
immediately fails with a floating point exception for any case where number
of tasks (-n) > 1.  Even when the example would otherwise successfully run
(for example, when run on a single node), I alwaysget the floating point
exception.  Diagnostic output follows:

shmuel@ssh01:/tmp
> srun -N1 -n2 1d_stencil_8
Localities,OS_Threads,Execution_Time_sec,Points_per_Partition,Partitions,Tim
e_Steps
2,     2,     0.1433835, 10,                   10,                   45

shmuel@ssh01:/tmp
> srun -N1 -n2 -O 1d_stencil_8
{stack-trace}: {stack-trace}: 15 frames:
0x7fd94c021809  : hpx::termination_handler(int) + 0x159 in
/opt/hpx/0.9.12/lib/libhpx.so.0
0x7fd948ad28d0  : ??? + 0x7fd948ad28d0 in
/lib/x86_64-linux-gnu/libpthread.so.0
0x7fd94c0f1de5  : ??? + 0x7fd94c0f1de5 in /opt/hpx/0.9.12/lib/libhpx.so.0
0x7fd94c5dbb7b  : ??? + 0x7fd94c5dbb7b in /opt/hpx/0.9.12/lib/libhpx.so.0
0x7fd94c5f01cb  :
hpx::threads::detail::thread_pool<hpx::threads::policies::local_priority_que
ue_scheduler<boost::mutex, hpx::threads::policies::lockfree_fifo,
hpx::threads::policies::lockfree_fifo,
hpx::threads::policies::lockfree_lifo>
>::create_thread(hpx::threads::thread_init_data&,
boost::intrusive_ptr<hpx::threads::thread_data>&,
hpx::threads::thread_state_enum, bool, hpx::error_code&) + 0x5b in
/opt/hpx/0.9.12/lib/libhpx.so.0
0x7fd94c07cccc  :
hpx::runtime_impl<hpx::threads::policies::local_priority_queue_scheduler<boo
st::mutex, hpx::threads::policies::lockfree_fifo,
hpx::threads::policies::lockfree_fifo,
hpx::threads::policies::lockfree_lifo> >::start(hpx::util::function<int (),
false> const&, bool) + 0x3ac in /opt/hpx/0.9.12/lib/libhpx.so.0
0x7fd94c080bae  :
hpx::runtime_impl<hpx::threads::policies::local_priority_queue_scheduler<boo
st::mutex, hpx::threads::policies::lockfree_fifo,
hpx::threads::policies::lockfree_fifo,
hpx::threads::policies::lockfree_lifo> >::run(hpx::util::function<int (),
false> const&) + 0xe in /opt/hpx/0.9.12/lib/libhpx.so.0
0x7fd94befd3e4  : ??? + 0x7fd94befd3e4 in /opt/hpx/0.9.12/lib/libhpx.so.0
0x7fd94bef84d3  : ??? + 0x7fd94bef84d3 in /opt/hpx/0.9.12/lib/libhpx.so.0
0x7fd94bef4ed2  : hpx::detail::run_or_start(hpx::util::function<int
(boost::program_options::variables_map&), false> const&,
boost::program_options::options_description const&, int, char**,
std::vector<std::string, std::allocator<std::string> >&&,
hpx::util::function<void (), false> const&, hpx::util::function<void (),
false> const&, hpx::runtime_mode, bool) + 0x442 in
/opt/hpx/0.9.12/lib/libhpx.so.0
0x57ad66        : ??? + 0x57ad66 in /opt/hpx/0.9.12/bin/1d_stencil_8
0x41dbc1        : ??? + 0x41dbc1 in /opt/hpx/0.9.12/bin/1d_stencil_8
0x7fd9466f4b45  : __libc_start_main + 0xf5 in
/lib/x86_64-linux-gnu/libc.so.6
0x41d5d9        : ??? + 0x41d5d9 in /opt/hpx/0.9.12/bin/1d_stencil_8
{what}: Floating point exception
15 frames:
0x7f9a0404e809  : hpx::termination_handler(int) + 0x159 in
/opt/hpx/0.9.12/lib/libhpx.so.0
0x7f9a00aff8d0  : ??? + 0x7f9a00aff8d0 in
/lib/x86_64-linux-gnu/libpthread.so.0
0x7f9a0411ede5  : ??? + 0x7f9a0411ede5 in /opt/hpx/0.9.12/lib/libhpx.so.0
0x7f9a04608b7b  : ??? + 0x7f9a04608b7b in /opt/hpx/0.9.12/lib/libhpx.so.0
0x7f9a0461d1cb  :
hpx::threads::detail::thread_pool<hpx::threads::policies::local_priority_que
ue_scheduler<boost::mutex, hpx::threads::policies::lockfree_fifo,
hpx::threads::policies::lockfree_fifo,
hpx::threads::policies::lockfree_lifo>
>::create_thread(hpx::threads::thread_init_data&,
boost::intrusive_ptr<hpx::threads::thread_data>&,
hpx::threads::thread_state_enum, bool, hpx::error_code&) + 0x5b in
/opt/hpx/0.9.12/lib/libhpx.so.0
0x7f9a040a9ccc  :
hpx::runtime_impl<hpx::threads::policies::local_priority_queue_scheduler<boo
st::mutex, hpx::threads::policies::lockfree_fifo,
hpx::threads::policies::lockfree_fifo,
hpx::threads::policies::lockfree_lifo> >::start(hpx::util::function<int (),
false> const&, bool) + 0x3ac in /opt/hpx/0.9.12/lib/libhpx.so.0
0x7f9a040adbae  :
hpx::runtime_impl<hpx::threads::policies::local_priority_queue_scheduler<boo
st::mutex, hpx::threads::policies::lockfree_fifo,
hpx::threads::policies::lockfree_fifo,
hpx::threads::policies::lockfree_lifo> >::run(hpx::util::function<int (),
false> const&) + 0xe in /opt/hpx/0.9.12/lib/libhpx.so.0
0x7f9a03f2a3e4  : ??? + 0x7f9a03f2a3e4 in /opt/hpx/0.9.12/lib/libhpx.so.0
0x7f9a03f254d3  : ??? + 0x7f9a03f254d3 in /opt/hpx/0.9.12/lib/libhpx.so.0
0x7f9a03f21ed2  : hpx::detail::run_or_start(hpx::util::function<int
(boost::program_options::variables_map&), false> const&,
boost::program_options::options_description const&, int, char**,
std::vector<std::string, std::allocator<std::string> >&&,
hpx::util::function<void (), false> const&, hpx::util::function<void (),
false> const&, hpx::runtime_mode, bool) + 0x442 in
/opt/hpx/0.9.12/lib/libhpx.so.0
0x57ad66        : ??? + 0x57ad66 in /opt/hpx/0.9.12/bin/1d_stencil_8
0x41dbc1        : ??? + 0x41dbc1 in /opt/hpx/0.9.12/bin/1d_stencil_8
0x7f99fe721b45  : __libc_start_main + 0xf5 in
/lib/x86_64-linux-gnu/libc.so.6
0x41d5d9        : ??? + 0x41d5d9 in /opt/hpx/0.9.12/bin/1d_stencil_8
{what}: Floating point exception
{config}:
  HPX_HAVE_NATIVE_TLS=ON
  HPX_HAVE_STACKTRACES=ON
  HPX_HAVE_COMPRESSION_BZIP2=OFF
  HPX_HAVE_COMPRESSION_SNAPPY=ON
  HPX_HAVE_COMPRESSION_ZLIB=OFF
  HPX_HAVE_PARCEL_COALESCING=ON
  HPX_HAVE_PARCELPORT_TCP=ON
  HPX_HAVE_PARCELPORT_MPI=ON (MPICH V3.1.2, MPI V3.0)
  HPX_HAVE_PARCELPORT_IPC=OFF
  HPX_HAVE_PARCELPORT_IBVERBS=OFF
  HPX_HAVE_VERIFY_LOCKS=OFF
  HPX_HAVE_HWLOC=ON
  HPX_HAVE_ITTNOTIFY=OFF
  HPX_HAVE_RUN_MAIN_EVERYWHERE=OFF
  HPX_PARCEL_MAX_CONNECTIONS=512
  HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
  HPX_INITIAL_AGAS_LOCAL_CACHE_SIZE=256
  HPX_AGAS_LOCAL_CACHE_SIZE_PER_THREAD=32
  HPX_HAVE_MALLOC=tcmalloc
  HPX_PREFIX (configured)=/opt/hpx/0.9.12
  HPX_PREFIX=/opt/hpx/0.9.12
{version}: V0.9.12-trunk (AGAS: V3.0), Git: bbe65bbd48
{boost}: V1.60.0
{build-type}: release
{date}: Jan 14 2016 20:16:12
{platform}: linux
{compiler}: Intel C++ C++0x mode version 1600
{stdlib}: GNU libstdc++ version 20141220
{config}:
  HPX_HAVE_NATIVE_TLS=ON
  HPX_HAVE_STACKTRACES=ON
  HPX_HAVE_COMPRESSION_BZIP2=OFF
  HPX_HAVE_COMPRESSION_SNAPPY=ON
  HPX_HAVE_COMPRESSION_ZLIB=OFF
  HPX_HAVE_PARCEL_COALESCING=ON
  HPX_HAVE_PARCELPORT_TCP=ON
  HPX_HAVE_PARCELPORT_MPI=ON (MPICH V3.1.2, MPI V3.0)
  HPX_HAVE_PARCELPORT_IPC=OFF
  HPX_HAVE_PARCELPORT_IBVERBS=OFF
  HPX_HAVE_VERIFY_LOCKS=OFF
  HPX_HAVE_HWLOC=ON
  HPX_HAVE_ITTNOTIFY=OFF
  HPX_HAVE_RUN_MAIN_EVERYWHERE=OFF
  HPX_PARCEL_MAX_CONNECTIONS=512
  HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
  HPX_INITIAL_AGAS_LOCAL_CACHE_SIZE=256
  HPX_AGAS_LOCAL_CACHE_SIZE_PER_THREAD=32
  HPX_HAVE_MALLOC=tcmalloc
  HPX_PREFIX (configured)=/opt/hpx/0.9.12
  HPX_PREFIX=/opt/hpx/0.9.12
{version}: V0.9.12-trunk (AGAS: V3.0), Git: bbe65bbd48
{boost}: V1.60.0
{build-type}: release
{date}: Jan 14 2016 20:16:12
{platform}: linux
{compiler}: Intel C++ C++0x mode version 1600
{stdlib}: GNU libstdc++ version 20141220
srun: error: hpc01: tasks 0-1: Aborted


I'm totally at a loss here.  I wouldn't put it past me to have
mis-configured slurm, although it appears to me that other applications/code
works fine with slurm (such as the slumrm test set, as well as simple
commands such as hostname).  I can run the following command: "srun -N3 -n80
-O hostname", without any issues at all.  I'm not sure that proves anything
-- hpx is orders of magnitude more complex than 'hostname'.

If it might help to solve my trouble, I can provide access to the cluster.


Incidentally, cmake completes with a warning that:

  Manually-specified variables were not used by the project:

    HPX_WITH_ITTNOTIFY
    HPX_WITH_TESTS_EXTERNAL_BUILD
    HPX_WTIH_TESTS

These variables are all noted in the current documentation.

Thanks again for all of your help.

Best regards,
Michael


> --
> Thomas Heller
> Friedrich-Alexander-Universität Erlangen-Nürnberg Department Informatik -
> Lehrstuhl Rechnerarchitektur Martensstr. 3
> 91058 Erlangen
> Tel.: 09131/85-27018
> Fax:  09131/85-27912
> Email: [email protected]
> _______________________________________________
> hpx-users mailing list
> [email protected]
> https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Reply via email to