Michael, That shouldn't happen. Could you please create a ticket (https://github.com/STEllAR-GROUP/hpx/issues) detailing the steps to reproduce the problem? Generally, any segfault is bad, we should report problems properly.
Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu > -----Original Message----- > From: [email protected] [mailto:hpx-users- > [email protected]] On Behalf Of Michael Levine > Sent: Monday, January 18, 2016 12:24 AM > To: [email protected] > Subject: Re: [hpx-users] hpx 0.9.11 segmentation fault running on multiple > localities > > Quick update- > > > > > This looks odd. It looks like a problem at startup when parsing the > SLURM > > environment. I never ran into that issue... This sounds like some > strange > > hickup between different allocators etc. Sometimes, cmake messes up the > > installation. Could you please try the same out of a fresh build > directory > after > > you removed the installation? > > I've completely removed all traces of the old installation from each of > the > 3 machines and rebuilt the code on all 3 systems, with Intel compiler v16 > update 1. The only difference from previous builds was changing the > install > prefix from /usr/local to a different directory, to make it easier to > isolate all files in the future. I also built the code on all 3 systems > rather than copying the binary files from 1 system to the others, just to > try and eliminate any possible causes for the errors. > > For reference, the build command used was as follows: > > cmake -DHPX_WITH_MALLOC=tcmalloc -DCMAKE_C_COMPILER=$(which icc) > -DCMAKE_CXX_COMPILER=$(which icpc) -DHPX_WTIH_TESTS=NO > -DHPX_WITH_EXAMPLES=YES -DHPX_WITH_TESTS_REGRESSIONS=NO > -DHPX_WITH_TESTS_UNIT=NO -DHPX_WITH_TESTS_EXTERNAL_BUILD=NO > -DBOOST_ROOT=/usr/src/boost_1_60_0 -DHPX_WITH_COMPRESSION_SNAPPY=TRUE > -DHPX_WITH_PARCELPORT_MPI=TRUE -DHPX_WITH_PARCELPORT_TCP=TRUE > -DHPX_WITH_ITTNOTIFY=TRUE -DMPI_C_COMPILER=$(which mpicc) > -DMPI_CXX_COMPILER=$(which mpicxx) -DCMAKE_INSTALL_PREFIX=/opt/hpx/0.9.12 > ../ > > Running the code gives me the following results: > 1. Running the example on each individual system works fine > 2. Using SLURM, running the example on a single node works fine, > regardless > of quantity of 'tasks' to create. I've tested this for each of the 3 > nodes: > srun -N1 -nX 1d_stencil_8, where X = {1..4} for hpc01 and hpc02 and X = > {1..8} for ssh01. > 3. Using SLURM, running the example on a single node works fine, > regardless > of quantity of 'cpus per task' to create. I've tested this for each of > the > 3 nodes: > srun -N1 -cX 1d_stencil_8, where X = {1..4} for hpc01 and hpc02 and X = > {1..8} for ssh01. > 4. Running the code on multiple nodes: works fine if n == N, otherwise > fails > with either a segmentation fault and stack trace or " src/tcmalloc.cc:278] > Attempt to free invalid pointer 0xfffffffffd8a8be8" > 5. When n == N, the code runs successfully for any valid value of -c (i.e. > ssh01 has 8 vCPUs, hpc0[1,2] has 4 vCPUs) > 6. If I try to allow overcommitting of resources (-O argument to srun), > hpx > immediately fails with a floating point exception for any case where > number > of tasks (-n) > 1. Even when the example would otherwise successfully run > (for example, when run on a single node), I alwaysget the floating point > exception. Diagnostic output follows: > > shmuel@ssh01:/tmp > > srun -N1 -n2 1d_stencil_8 > Localities,OS_Threads,Execution_Time_sec,Points_per_Partition,Partitions,T > im > e_Steps > 2, 2, 0.1433835, 10, 10, 45 > > shmuel@ssh01:/tmp > > srun -N1 -n2 -O 1d_stencil_8 > {stack-trace}: {stack-trace}: 15 frames: > 0x7fd94c021809 : hpx::termination_handler(int) + 0x159 in > /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7fd948ad28d0 : ??? + 0x7fd948ad28d0 in > /lib/x86_64-linux-gnu/libpthread.so.0 > 0x7fd94c0f1de5 : ??? + 0x7fd94c0f1de5 in /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7fd94c5dbb7b : ??? + 0x7fd94c5dbb7b in /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7fd94c5f01cb : > hpx::threads::detail::thread_pool<hpx::threads::policies::local_priority_q > ue > ue_scheduler<boost::mutex, hpx::threads::policies::lockfree_fifo, > hpx::threads::policies::lockfree_fifo, > hpx::threads::policies::lockfree_lifo> > >::create_thread(hpx::threads::thread_init_data&, > boost::intrusive_ptr<hpx::threads::thread_data>&, > hpx::threads::thread_state_enum, bool, hpx::error_code&) + 0x5b in > /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7fd94c07cccc : > hpx::runtime_impl<hpx::threads::policies::local_priority_queue_scheduler<b > oo > st::mutex, hpx::threads::policies::lockfree_fifo, > hpx::threads::policies::lockfree_fifo, > hpx::threads::policies::lockfree_lifo> >::start(hpx::util::function<int > (), > false> const&, bool) + 0x3ac in /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7fd94c080bae : > hpx::runtime_impl<hpx::threads::policies::local_priority_queue_scheduler<b > oo > st::mutex, hpx::threads::policies::lockfree_fifo, > hpx::threads::policies::lockfree_fifo, > hpx::threads::policies::lockfree_lifo> >::run(hpx::util::function<int (), > false> const&) + 0xe in /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7fd94befd3e4 : ??? + 0x7fd94befd3e4 in /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7fd94bef84d3 : ??? + 0x7fd94bef84d3 in /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7fd94bef4ed2 : hpx::detail::run_or_start(hpx::util::function<int > (boost::program_options::variables_map&), false> const&, > boost::program_options::options_description const&, int, char**, > std::vector<std::string, std::allocator<std::string> >&&, > hpx::util::function<void (), false> const&, hpx::util::function<void (), > false> const&, hpx::runtime_mode, bool) + 0x442 in > /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x57ad66 : ??? + 0x57ad66 in /opt/hpx/0.9.12/bin/1d_stencil_8 > 0x41dbc1 : ??? + 0x41dbc1 in /opt/hpx/0.9.12/bin/1d_stencil_8 > 0x7fd9466f4b45 : __libc_start_main + 0xf5 in > /lib/x86_64-linux-gnu/libc.so.6 > 0x41d5d9 : ??? + 0x41d5d9 in /opt/hpx/0.9.12/bin/1d_stencil_8 > {what}: Floating point exception > 15 frames: > 0x7f9a0404e809 : hpx::termination_handler(int) + 0x159 in > /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7f9a00aff8d0 : ??? + 0x7f9a00aff8d0 in > /lib/x86_64-linux-gnu/libpthread.so.0 > 0x7f9a0411ede5 : ??? + 0x7f9a0411ede5 in /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7f9a04608b7b : ??? + 0x7f9a04608b7b in /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7f9a0461d1cb : > hpx::threads::detail::thread_pool<hpx::threads::policies::local_priority_q > ue > ue_scheduler<boost::mutex, hpx::threads::policies::lockfree_fifo, > hpx::threads::policies::lockfree_fifo, > hpx::threads::policies::lockfree_lifo> > >::create_thread(hpx::threads::thread_init_data&, > boost::intrusive_ptr<hpx::threads::thread_data>&, > hpx::threads::thread_state_enum, bool, hpx::error_code&) + 0x5b in > /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7f9a040a9ccc : > hpx::runtime_impl<hpx::threads::policies::local_priority_queue_scheduler<b > oo > st::mutex, hpx::threads::policies::lockfree_fifo, > hpx::threads::policies::lockfree_fifo, > hpx::threads::policies::lockfree_lifo> >::start(hpx::util::function<int > (), > false> const&, bool) + 0x3ac in /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7f9a040adbae : > hpx::runtime_impl<hpx::threads::policies::local_priority_queue_scheduler<b > oo > st::mutex, hpx::threads::policies::lockfree_fifo, > hpx::threads::policies::lockfree_fifo, > hpx::threads::policies::lockfree_lifo> >::run(hpx::util::function<int (), > false> const&) + 0xe in /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7f9a03f2a3e4 : ??? + 0x7f9a03f2a3e4 in /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7f9a03f254d3 : ??? + 0x7f9a03f254d3 in /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x7f9a03f21ed2 : hpx::detail::run_or_start(hpx::util::function<int > (boost::program_options::variables_map&), false> const&, > boost::program_options::options_description const&, int, char**, > std::vector<std::string, std::allocator<std::string> >&&, > hpx::util::function<void (), false> const&, hpx::util::function<void (), > false> const&, hpx::runtime_mode, bool) + 0x442 in > /opt/hpx/0.9.12/lib/libhpx.so.0 > 0x57ad66 : ??? + 0x57ad66 in /opt/hpx/0.9.12/bin/1d_stencil_8 > 0x41dbc1 : ??? + 0x41dbc1 in /opt/hpx/0.9.12/bin/1d_stencil_8 > 0x7f99fe721b45 : __libc_start_main + 0xf5 in > /lib/x86_64-linux-gnu/libc.so.6 > 0x41d5d9 : ??? + 0x41d5d9 in /opt/hpx/0.9.12/bin/1d_stencil_8 > {what}: Floating point exception > {config}: > HPX_HAVE_NATIVE_TLS=ON > HPX_HAVE_STACKTRACES=ON > HPX_HAVE_COMPRESSION_BZIP2=OFF > HPX_HAVE_COMPRESSION_SNAPPY=ON > HPX_HAVE_COMPRESSION_ZLIB=OFF > HPX_HAVE_PARCEL_COALESCING=ON > HPX_HAVE_PARCELPORT_TCP=ON > HPX_HAVE_PARCELPORT_MPI=ON (MPICH V3.1.2, MPI V3.0) > HPX_HAVE_PARCELPORT_IPC=OFF > HPX_HAVE_PARCELPORT_IBVERBS=OFF > HPX_HAVE_VERIFY_LOCKS=OFF > HPX_HAVE_HWLOC=ON > HPX_HAVE_ITTNOTIFY=OFF > HPX_HAVE_RUN_MAIN_EVERYWHERE=OFF > HPX_PARCEL_MAX_CONNECTIONS=512 > HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4 > HPX_INITIAL_AGAS_LOCAL_CACHE_SIZE=256 > HPX_AGAS_LOCAL_CACHE_SIZE_PER_THREAD=32 > HPX_HAVE_MALLOC=tcmalloc > HPX_PREFIX (configured)=/opt/hpx/0.9.12 > HPX_PREFIX=/opt/hpx/0.9.12 > {version}: V0.9.12-trunk (AGAS: V3.0), Git: bbe65bbd48 > {boost}: V1.60.0 > {build-type}: release > {date}: Jan 14 2016 20:16:12 > {platform}: linux > {compiler}: Intel C++ C++0x mode version 1600 > {stdlib}: GNU libstdc++ version 20141220 > {config}: > HPX_HAVE_NATIVE_TLS=ON > HPX_HAVE_STACKTRACES=ON > HPX_HAVE_COMPRESSION_BZIP2=OFF > HPX_HAVE_COMPRESSION_SNAPPY=ON > HPX_HAVE_COMPRESSION_ZLIB=OFF > HPX_HAVE_PARCEL_COALESCING=ON > HPX_HAVE_PARCELPORT_TCP=ON > HPX_HAVE_PARCELPORT_MPI=ON (MPICH V3.1.2, MPI V3.0) > HPX_HAVE_PARCELPORT_IPC=OFF > HPX_HAVE_PARCELPORT_IBVERBS=OFF > HPX_HAVE_VERIFY_LOCKS=OFF > HPX_HAVE_HWLOC=ON > HPX_HAVE_ITTNOTIFY=OFF > HPX_HAVE_RUN_MAIN_EVERYWHERE=OFF > HPX_PARCEL_MAX_CONNECTIONS=512 > HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4 > HPX_INITIAL_AGAS_LOCAL_CACHE_SIZE=256 > HPX_AGAS_LOCAL_CACHE_SIZE_PER_THREAD=32 > HPX_HAVE_MALLOC=tcmalloc > HPX_PREFIX (configured)=/opt/hpx/0.9.12 > HPX_PREFIX=/opt/hpx/0.9.12 > {version}: V0.9.12-trunk (AGAS: V3.0), Git: bbe65bbd48 > {boost}: V1.60.0 > {build-type}: release > {date}: Jan 14 2016 20:16:12 > {platform}: linux > {compiler}: Intel C++ C++0x mode version 1600 > {stdlib}: GNU libstdc++ version 20141220 > srun: error: hpc01: tasks 0-1: Aborted > > > I'm totally at a loss here. I wouldn't put it past me to have > mis-configured slurm, although it appears to me that other > applications/code > works fine with slurm (such as the slumrm test set, as well as simple > commands such as hostname). I can run the following command: "srun -N3 - > n80 > -O hostname", without any issues at all. I'm not sure that proves > anything > -- hpx is orders of magnitude more complex than 'hostname'. > > If it might help to solve my trouble, I can provide access to the cluster. > > > Incidentally, cmake completes with a warning that: > > Manually-specified variables were not used by the project: > > HPX_WITH_ITTNOTIFY > HPX_WITH_TESTS_EXTERNAL_BUILD > HPX_WTIH_TESTS > > These variables are all noted in the current documentation. > > Thanks again for all of your help. > > Best regards, > Michael > > > > -- > > Thomas Heller > > Friedrich-Alexander-Universität Erlangen-Nürnberg Department Informatik > - > > Lehrstuhl Rechnerarchitektur Martensstr. 3 > > 91058 Erlangen > > Tel.: 09131/85-27018 > > Fax: 09131/85-27912 > > Email: [email protected] > > _______________________________________________ > > hpx-users mailing list > > [email protected] > > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users > > _______________________________________________ > hpx-users mailing list > [email protected] > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users _______________________________________________ hpx-users mailing list [email protected] https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
