Michael,

That shouldn't happen. Could you please create a ticket
(https://github.com/STEllAR-GROUP/hpx/issues) detailing the steps to
reproduce the problem? Generally, any segfault is bad, we should report
problems properly.

Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu


> -----Original Message-----
> From: [email protected] [mailto:hpx-users-
> [email protected]] On Behalf Of Michael Levine
> Sent: Monday, January 18, 2016 12:24 AM
> To: [email protected]
> Subject: Re: [hpx-users] hpx 0.9.11 segmentation fault running on multiple
> localities
> 
> Quick update-
> 
> >
> > This looks odd. It looks like a problem at startup when parsing the
> SLURM
> > environment. I never ran into that issue... This sounds like some
> strange
> > hickup between different allocators etc. Sometimes, cmake messes up the
> > installation. Could you please try the same out of a fresh build
> directory
> after
> > you removed the installation?
> 
> I've completely removed all traces of the old installation from each of
> the
> 3 machines and rebuilt the code on all 3 systems, with Intel compiler v16
> update 1.  The only difference from previous builds was changing the
> install
> prefix from /usr/local to a different directory, to make it easier to
> isolate all files in the future.  I also built the code on all 3 systems
> rather than copying the binary files from 1 system to the others, just to
> try and eliminate any possible causes for the errors.
> 
> For reference, the build command used was as follows:
> > cmake -DHPX_WITH_MALLOC=tcmalloc -DCMAKE_C_COMPILER=$(which icc)
> -DCMAKE_CXX_COMPILER=$(which icpc) -DHPX_WTIH_TESTS=NO
> -DHPX_WITH_EXAMPLES=YES -DHPX_WITH_TESTS_REGRESSIONS=NO
> -DHPX_WITH_TESTS_UNIT=NO -DHPX_WITH_TESTS_EXTERNAL_BUILD=NO
> -DBOOST_ROOT=/usr/src/boost_1_60_0 -DHPX_WITH_COMPRESSION_SNAPPY=TRUE
> -DHPX_WITH_PARCELPORT_MPI=TRUE -DHPX_WITH_PARCELPORT_TCP=TRUE
> -DHPX_WITH_ITTNOTIFY=TRUE -DMPI_C_COMPILER=$(which mpicc)
> -DMPI_CXX_COMPILER=$(which mpicxx) -DCMAKE_INSTALL_PREFIX=/opt/hpx/0.9.12
> ../
> 
> Running the code gives me the following results:
> 1. Running the example on each individual system works fine
> 2. Using SLURM, running the example on a single node works fine,
> regardless
> of quantity of 'tasks' to create.  I've tested this for each of the 3
> nodes:
> srun -N1 -nX 1d_stencil_8, where X = {1..4} for hpc01 and hpc02 and X =
> {1..8} for ssh01.
> 3. Using SLURM, running the example on a single node works fine,
> regardless
> of quantity of 'cpus per task' to create.  I've tested this for each of
> the
> 3 nodes:
> srun -N1 -cX 1d_stencil_8, where X = {1..4} for hpc01 and hpc02 and X =
> {1..8} for ssh01.
> 4. Running the code on multiple nodes: works fine if n == N, otherwise
> fails
> with either a segmentation fault and stack trace or " src/tcmalloc.cc:278]
> Attempt to free invalid pointer 0xfffffffffd8a8be8"
> 5. When n == N, the code runs successfully for any valid value of -c (i.e.
> ssh01 has 8 vCPUs, hpc0[1,2] has 4 vCPUs)
> 6. If I try to allow overcommitting of resources (-O argument to srun),
> hpx
> immediately fails with a floating point exception for any case where
> number
> of tasks (-n) > 1.  Even when the example would otherwise successfully run
> (for example, when run on a single node), I alwaysget the floating point
> exception.  Diagnostic output follows:
> 
> shmuel@ssh01:/tmp
> > srun -N1 -n2 1d_stencil_8
> Localities,OS_Threads,Execution_Time_sec,Points_per_Partition,Partitions,T
> im
> e_Steps
> 2,     2,     0.1433835, 10,                   10,                   45
> 
> shmuel@ssh01:/tmp
> > srun -N1 -n2 -O 1d_stencil_8
> {stack-trace}: {stack-trace}: 15 frames:
> 0x7fd94c021809  : hpx::termination_handler(int) + 0x159 in
> /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7fd948ad28d0  : ??? + 0x7fd948ad28d0 in
> /lib/x86_64-linux-gnu/libpthread.so.0
> 0x7fd94c0f1de5  : ??? + 0x7fd94c0f1de5 in /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7fd94c5dbb7b  : ??? + 0x7fd94c5dbb7b in /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7fd94c5f01cb  :
> hpx::threads::detail::thread_pool<hpx::threads::policies::local_priority_q
> ue
> ue_scheduler<boost::mutex, hpx::threads::policies::lockfree_fifo,
> hpx::threads::policies::lockfree_fifo,
> hpx::threads::policies::lockfree_lifo>
> >::create_thread(hpx::threads::thread_init_data&,
> boost::intrusive_ptr<hpx::threads::thread_data>&,
> hpx::threads::thread_state_enum, bool, hpx::error_code&) + 0x5b in
> /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7fd94c07cccc  :
> hpx::runtime_impl<hpx::threads::policies::local_priority_queue_scheduler<b
> oo
> st::mutex, hpx::threads::policies::lockfree_fifo,
> hpx::threads::policies::lockfree_fifo,
> hpx::threads::policies::lockfree_lifo> >::start(hpx::util::function<int
> (),
> false> const&, bool) + 0x3ac in /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7fd94c080bae  :
> hpx::runtime_impl<hpx::threads::policies::local_priority_queue_scheduler<b
> oo
> st::mutex, hpx::threads::policies::lockfree_fifo,
> hpx::threads::policies::lockfree_fifo,
> hpx::threads::policies::lockfree_lifo> >::run(hpx::util::function<int (),
> false> const&) + 0xe in /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7fd94befd3e4  : ??? + 0x7fd94befd3e4 in /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7fd94bef84d3  : ??? + 0x7fd94bef84d3 in /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7fd94bef4ed2  : hpx::detail::run_or_start(hpx::util::function<int
> (boost::program_options::variables_map&), false> const&,
> boost::program_options::options_description const&, int, char**,
> std::vector<std::string, std::allocator<std::string> >&&,
> hpx::util::function<void (), false> const&, hpx::util::function<void (),
> false> const&, hpx::runtime_mode, bool) + 0x442 in
> /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x57ad66        : ??? + 0x57ad66 in /opt/hpx/0.9.12/bin/1d_stencil_8
> 0x41dbc1        : ??? + 0x41dbc1 in /opt/hpx/0.9.12/bin/1d_stencil_8
> 0x7fd9466f4b45  : __libc_start_main + 0xf5 in
> /lib/x86_64-linux-gnu/libc.so.6
> 0x41d5d9        : ??? + 0x41d5d9 in /opt/hpx/0.9.12/bin/1d_stencil_8
> {what}: Floating point exception
> 15 frames:
> 0x7f9a0404e809  : hpx::termination_handler(int) + 0x159 in
> /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7f9a00aff8d0  : ??? + 0x7f9a00aff8d0 in
> /lib/x86_64-linux-gnu/libpthread.so.0
> 0x7f9a0411ede5  : ??? + 0x7f9a0411ede5 in /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7f9a04608b7b  : ??? + 0x7f9a04608b7b in /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7f9a0461d1cb  :
> hpx::threads::detail::thread_pool<hpx::threads::policies::local_priority_q
> ue
> ue_scheduler<boost::mutex, hpx::threads::policies::lockfree_fifo,
> hpx::threads::policies::lockfree_fifo,
> hpx::threads::policies::lockfree_lifo>
> >::create_thread(hpx::threads::thread_init_data&,
> boost::intrusive_ptr<hpx::threads::thread_data>&,
> hpx::threads::thread_state_enum, bool, hpx::error_code&) + 0x5b in
> /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7f9a040a9ccc  :
> hpx::runtime_impl<hpx::threads::policies::local_priority_queue_scheduler<b
> oo
> st::mutex, hpx::threads::policies::lockfree_fifo,
> hpx::threads::policies::lockfree_fifo,
> hpx::threads::policies::lockfree_lifo> >::start(hpx::util::function<int
> (),
> false> const&, bool) + 0x3ac in /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7f9a040adbae  :
> hpx::runtime_impl<hpx::threads::policies::local_priority_queue_scheduler<b
> oo
> st::mutex, hpx::threads::policies::lockfree_fifo,
> hpx::threads::policies::lockfree_fifo,
> hpx::threads::policies::lockfree_lifo> >::run(hpx::util::function<int (),
> false> const&) + 0xe in /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7f9a03f2a3e4  : ??? + 0x7f9a03f2a3e4 in /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7f9a03f254d3  : ??? + 0x7f9a03f254d3 in /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x7f9a03f21ed2  : hpx::detail::run_or_start(hpx::util::function<int
> (boost::program_options::variables_map&), false> const&,
> boost::program_options::options_description const&, int, char**,
> std::vector<std::string, std::allocator<std::string> >&&,
> hpx::util::function<void (), false> const&, hpx::util::function<void (),
> false> const&, hpx::runtime_mode, bool) + 0x442 in
> /opt/hpx/0.9.12/lib/libhpx.so.0
> 0x57ad66        : ??? + 0x57ad66 in /opt/hpx/0.9.12/bin/1d_stencil_8
> 0x41dbc1        : ??? + 0x41dbc1 in /opt/hpx/0.9.12/bin/1d_stencil_8
> 0x7f99fe721b45  : __libc_start_main + 0xf5 in
> /lib/x86_64-linux-gnu/libc.so.6
> 0x41d5d9        : ??? + 0x41d5d9 in /opt/hpx/0.9.12/bin/1d_stencil_8
> {what}: Floating point exception
> {config}:
>   HPX_HAVE_NATIVE_TLS=ON
>   HPX_HAVE_STACKTRACES=ON
>   HPX_HAVE_COMPRESSION_BZIP2=OFF
>   HPX_HAVE_COMPRESSION_SNAPPY=ON
>   HPX_HAVE_COMPRESSION_ZLIB=OFF
>   HPX_HAVE_PARCEL_COALESCING=ON
>   HPX_HAVE_PARCELPORT_TCP=ON
>   HPX_HAVE_PARCELPORT_MPI=ON (MPICH V3.1.2, MPI V3.0)
>   HPX_HAVE_PARCELPORT_IPC=OFF
>   HPX_HAVE_PARCELPORT_IBVERBS=OFF
>   HPX_HAVE_VERIFY_LOCKS=OFF
>   HPX_HAVE_HWLOC=ON
>   HPX_HAVE_ITTNOTIFY=OFF
>   HPX_HAVE_RUN_MAIN_EVERYWHERE=OFF
>   HPX_PARCEL_MAX_CONNECTIONS=512
>   HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
>   HPX_INITIAL_AGAS_LOCAL_CACHE_SIZE=256
>   HPX_AGAS_LOCAL_CACHE_SIZE_PER_THREAD=32
>   HPX_HAVE_MALLOC=tcmalloc
>   HPX_PREFIX (configured)=/opt/hpx/0.9.12
>   HPX_PREFIX=/opt/hpx/0.9.12
> {version}: V0.9.12-trunk (AGAS: V3.0), Git: bbe65bbd48
> {boost}: V1.60.0
> {build-type}: release
> {date}: Jan 14 2016 20:16:12
> {platform}: linux
> {compiler}: Intel C++ C++0x mode version 1600
> {stdlib}: GNU libstdc++ version 20141220
> {config}:
>   HPX_HAVE_NATIVE_TLS=ON
>   HPX_HAVE_STACKTRACES=ON
>   HPX_HAVE_COMPRESSION_BZIP2=OFF
>   HPX_HAVE_COMPRESSION_SNAPPY=ON
>   HPX_HAVE_COMPRESSION_ZLIB=OFF
>   HPX_HAVE_PARCEL_COALESCING=ON
>   HPX_HAVE_PARCELPORT_TCP=ON
>   HPX_HAVE_PARCELPORT_MPI=ON (MPICH V3.1.2, MPI V3.0)
>   HPX_HAVE_PARCELPORT_IPC=OFF
>   HPX_HAVE_PARCELPORT_IBVERBS=OFF
>   HPX_HAVE_VERIFY_LOCKS=OFF
>   HPX_HAVE_HWLOC=ON
>   HPX_HAVE_ITTNOTIFY=OFF
>   HPX_HAVE_RUN_MAIN_EVERYWHERE=OFF
>   HPX_PARCEL_MAX_CONNECTIONS=512
>   HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
>   HPX_INITIAL_AGAS_LOCAL_CACHE_SIZE=256
>   HPX_AGAS_LOCAL_CACHE_SIZE_PER_THREAD=32
>   HPX_HAVE_MALLOC=tcmalloc
>   HPX_PREFIX (configured)=/opt/hpx/0.9.12
>   HPX_PREFIX=/opt/hpx/0.9.12
> {version}: V0.9.12-trunk (AGAS: V3.0), Git: bbe65bbd48
> {boost}: V1.60.0
> {build-type}: release
> {date}: Jan 14 2016 20:16:12
> {platform}: linux
> {compiler}: Intel C++ C++0x mode version 1600
> {stdlib}: GNU libstdc++ version 20141220
> srun: error: hpc01: tasks 0-1: Aborted
> 
> 
> I'm totally at a loss here.  I wouldn't put it past me to have
> mis-configured slurm, although it appears to me that other
> applications/code
> works fine with slurm (such as the slumrm test set, as well as simple
> commands such as hostname).  I can run the following command: "srun -N3 -
> n80
> -O hostname", without any issues at all.  I'm not sure that proves
> anything
> -- hpx is orders of magnitude more complex than 'hostname'.
> 
> If it might help to solve my trouble, I can provide access to the cluster.
> 
> 
> Incidentally, cmake completes with a warning that:
> 
>   Manually-specified variables were not used by the project:
> 
>     HPX_WITH_ITTNOTIFY
>     HPX_WITH_TESTS_EXTERNAL_BUILD
>     HPX_WTIH_TESTS
> 
> These variables are all noted in the current documentation.
> 
> Thanks again for all of your help.
> 
> Best regards,
> Michael
> 
> 
> > --
> > Thomas Heller
> > Friedrich-Alexander-Universität Erlangen-Nürnberg Department Informatik
> -
> > Lehrstuhl Rechnerarchitektur Martensstr. 3
> > 91058 Erlangen
> > Tel.: 09131/85-27018
> > Fax:  09131/85-27912
> > Email: [email protected]
> > _______________________________________________
> > hpx-users mailing list
> > [email protected]
> > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
> 
> _______________________________________________
> hpx-users mailing list
> [email protected]
> https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Reply via email to