Hi,

>From looking at the stacktrace, it is a segfault coming directly out of
MPI, which is then caught by our signal handlers.
In theory, there shouldn't be any problem with having multiple MPI
libraries running within HPX. The HPX parcelport tries to be a good citizen
and creates its own communicator. The problematic part however, might be
that you either have multiple calls to MPI_Init (HPX itself should handle
that correctly) or that the MPI implementation you are using is not thread
safe. HPX is driving MPI from all of its worker threads. For the purpose of
making MPI non thread implementations not crash, we use a lock to protect
each and every call into MPI (
https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/plugins/parcelport/mpi/mpi_environment.hpp#L42).
If you add a call to that around your pympi4 stuff, it might just work.

The suspension of the runtime should work as well. As soon as all worker
threads are suspended, there won't be any calls to MPI anymore. There still
might be incoming messages from other localities, but that shouldn't be a
problem.

I hope that scheds some light onto that problem.


On Tue, Oct 23, 2018 at 11:37 PM Simberg Mikael <simbe...@cscs.ch> wrote:

> Hi,
>
> hopefully someone else can chime in on the MPI and Python side of things,
> but thought I'd comment shortly on the runtime suspension since I
> implemented it.
>
> The reason for requiring a only a single locality for runtime suspension
> is simply that I never tested it with multiple localities. It may very well
> already work with multiple localities, but I didn't want users to get the
> impression that it's a well-tested feature. So if this is indeed useful for
> you you could try removing the check (you probably already found it, let me
> know if that's not the case) and rebuilding HPX.
>
> I suspect though that runtime suspension won't help you here since it
> doesn't actually disable MPI or anything else. All it does is put the HPX
> worker threads to sleep once all work is completed.
>
> In this case there might be a problem with our MPI parcelport interfering
> with mpi4py. It's not entirely clear to me if you want to use the
> networking features of HPX in addition to MPI. If not you can also build
> HPX with HPX_WITH_NETWORKING=OFF which will... disable networking. This
> branch is also meant to disable some networking related features at runtime
> if you're only using one locality:
> https://github.com/STEllAR-GROUP/hpx/pull/3486.
>
> Kind regards,
> Mikael
> ------------------------------
> *From:* hpx-users-boun...@stellar.cct.lsu.edu [
> hpx-users-boun...@stellar.cct.lsu.edu] on behalf of Vance, James [
> va...@uni-mainz.de]
> *Sent:* Tuesday, October 23, 2018 4:38 PM
> *To:* hpx-users@stellar.cct.lsu.edu
> *Subject:* [hpx-users] Segmentation fault with mpi4py
>
> Hi everyone,
>
> I am trying to gradually port the molecular dynamics code Espresso++ from
> its current pure-MPI form to one that uses HPX for the critical parts of
> the code. It consists of a C++ and MPI-based shared library that can be
> imported in python using the boost.python library, a collection of python
> modules, and an mpi4py-based library for communication among the python
> processes.
>
> I was able to properly initialize and terminate the HPX runtime
> environment from python using the methods
> in hpx/examples/quickstart/init_globally.cpp
> and phylanx/python/src/init_hpx.cpp. However, when I use mpi4py to perform
> MPI-based communication from within a python script that also runs HPX, I
> encounter a segmentation fault with the following trace:
>
> ---------------------------------
> {stack-trace}: 21 frames:
> 0x2abc616b08f2  : ??? + 0x2abc616b08f2 in
> /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1
> 0x2abc616ad06c  : hpx::termination_handler(int) + 0x15c in
> /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1
> 0x2abc5979b370  : ??? + 0x2abc5979b370 in /lib64/libpthread.so.0
> 0x2abc62755a76  : mca_pml_cm_recv_request_completion + 0xb6 in
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
> 0x2abc626f4ac9  : ompi_mtl_psm2_progress + 0x59 in
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
> 0x2abc63383eec  : opal_progress + 0x3c in
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libopen-pal.so.20
> 0x2abc62630a75  : ompi_request_default_wait + 0x105 in
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
> 0x2abc6267be92  : ompi_coll_base_bcast_intra_generic + 0x5b2 in
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
> 0x2abc6267c262  : ompi_coll_base_bcast_intra_binomial + 0xb2 in
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
> 0x2abc6268803b  : ompi_coll_tuned_bcast_intra_dec_fixed + 0xcb in
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
> 0x2abc62642bc0  : PMPI_Bcast + 0x1a0 in
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
> 0x2abc64cea17f  : ??? + 0x2abc64cea17f in
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/python2.7/site-packages/mpi4py/MPI.so
> 0x2abc59176f9b  : PyEval_EvalFrameEx + 0x923b in
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
> 0x2abc5917879a  : PyEval_EvalCodeEx + 0x87a in
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
> 0x2abc59178ba9  : PyEval_EvalCode + 0x19 in
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
> 0x2abc5919cb4a  : PyRun_FileExFlags + 0x8a in
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
> 0x2abc5919df25  : PyRun_SimpleFileExFlags + 0xd5 in
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
> 0x2abc591b44e1  : Py_Main + 0xc61 in
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
> 0x2abc59bccb35  : __libc_start_main + 0xf5 in /lib64/libc.so.6
> 0x40071e        : ??? + 0x40071e in python
> {what}: Segmentation fault
> {config}:
>   HPX_WITH_AGAS_DUMP_REFCNT_ENTRIES=OFF
>   HPX_WITH_APEX=OFF
>   HPX_WITH_ATTACH_DEBUGGER_ON_TEST_FAILURE=OFF
>   HPX_WITH_AUTOMATIC_SERIALIZATION_REGISTRATION=ON
>   HPX_WITH_CXX14_RETURN_TYPE_DEDUCTION=TRUE
>   HPX_WITH_DEPRECATION_WARNINGS=ON
>   HPX_WITH_GOOGLE_PERFTOOLS=OFF
>   HPX_WITH_INCLUSIVE_SCAN_COMPATIBILITY=ON
>   HPX_WITH_IO_COUNTERS=ON
>   HPX_WITH_IO_POOL=ON
>   HPX_WITH_ITTNOTIFY=OFF
>   HPX_WITH_LOGGING=ON
>   HPX_WITH_MORE_THAN_64_THREADS=OFF
>   HPX_WITH_NATIVE_TLS=ON
>   HPX_WITH_NETWORKING=ON
>   HPX_WITH_PAPI=OFF
>   HPX_WITH_PARCELPORT_ACTION_COUNTERS=OFF
>   HPX_WITH_PARCELPORT_LIBFABRIC=OFF
>   HPX_WITH_PARCELPORT_MPI=ON
>   HPX_WITH_PARCELPORT_MPI_MULTITHREADED=ON
>   HPX_WITH_PARCELPORT_TCP=ON
>   HPX_WITH_PARCELPORT_VERBS=OFF
>   HPX_WITH_PARCEL_COALESCING=ON
>   HPX_WITH_PARCEL_PROFILING=OFF
>   HPX_WITH_SCHEDULER_LOCAL_STORAGE=OFF
>   HPX_WITH_SPINLOCK_DEADLOCK_DETECTION=OFF
>   HPX_WITH_STACKTRACES=ON
>   HPX_WITH_SWAP_CONTEXT_EMULATION=OFF
>   HPX_WITH_THREAD_BACKTRACE_ON_SUSPENSION=OFF
>   HPX_WITH_THREAD_CREATION_AND_CLEANUP_RATES=OFF
>   HPX_WITH_THREAD_CUMULATIVE_COUNTS=ON
>   HPX_WITH_THREAD_DEBUG_INFO=OFF
>   HPX_WITH_THREAD_DESCRIPTION_FULL=OFF
>   HPX_WITH_THREAD_GUARD_PAGE=ON
>   HPX_WITH_THREAD_IDLE_RATES=ON
>   HPX_WITH_THREAD_LOCAL_STORAGE=OFF
>   HPX_WITH_THREAD_MANAGER_IDLE_BACKOFF=ON
>   HPX_WITH_THREAD_QUEUE_WAITTIME=OFF
>   HPX_WITH_THREAD_STACK_MMAP=ON
>   HPX_WITH_THREAD_STEALING_COUNTS=ON
>   HPX_WITH_THREAD_TARGET_ADDRESS=OFF
>   HPX_WITH_TIMER_POOL=ON
>   HPX_WITH_TUPLE_RVALUE_SWAP=ON
>   HPX_WITH_UNWRAPPED_COMPATIBILITY=ON
>   HPX_WITH_VALGRIND=OFF
>   HPX_WITH_VERIFY_LOCKS=OFF
>   HPX_WITH_VERIFY_LOCKS_BACKTRACE=OFF
>   HPX_WITH_VERIFY_LOCKS_GLOBALLY=OFF
>
>   HPX_PARCEL_MAX_CONNECTIONS=512
>   HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
>   HPX_AGAS_LOCAL_CACHE_SIZE=4096
>   HPX_HAVE_MALLOC=JEMALLOC
>   HPX_PREFIX
> (configured)=/lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install
>
> HPX_PREFIX=/lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install
> {version}: V1.1.0-rc1 (AGAS: V3.0), Git: unknown
> {boost}: V1.65.1
> {build-type}: release
> {date}: Sep 25 2018 11:01:34
> {platform}: linux
> {compiler}: GNU C++ version 6.3.0
> {stdlib}: GNU libstdc++ version 20161221
> [login21:18535] *** Process received signal ***
> [login21:18535] Signal: Aborted (6)
> [login21:18535] Signal code:  (-6)
> [login21:18535] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2abc5979b370]
> [login21:18535] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2abc59be01d7]
> [login21:18535] [ 2] /lib64/libc.so.6(abort+0x148)[0x2abc59be18c8]
> [login21:18535] [ 3]
> /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1(_ZN3hpx19termination_handlerEi+0x213)[0x2abc616ad123]
> [login21:18535] [ 4] /lib64/libpthread.so.0(+0xf370)[0x2abc5979b370]
> [login21:18535] [ 5]
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(mca_pml_cm_recv_request_completion+0xb6)[0x2abc62755a76]
> [login21:18535] [ 6]
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_mtl_psm2_progress+0x59)[0x2abc626f4ac9]
> [login21:18535] [ 7]
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libopen-pal.so.20(opal_progress+0x3c)[0x2abc63383eec]
> [login21:18535] [ 8]
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_request_default_wait+0x105)[0x2abc62630a75]
> [login21:18535] [ 9]
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x5b2)[0x2abc6267be92]
> [login21:18535] [10]
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+0xb2)[0x2abc6267c262]
> [login21:18535] [11]
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_coll_tuned_bcast_intra_dec_fixed+0xcb)[0x2abc6268803b]
> [login21:18535] [12]
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(PMPI_Bcast+0x1a0)[0x2abc62642bc0]
> [login21:18535] [13]
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/python2.7/site-packages/mpi4py/MPI.so(+0xa517f)[0x2abc64cea17f]
> [login21:18535] [14]
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x923b)[0x2abc59176f9b]
> [login21:18535] [15]
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x87a)[0x2abc5917879a]
> [login21:18535] [16]
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x2abc59178ba9]
> [login21:18535] [17]
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyRun_FileExFlags+0x8a)[0x2abc5919cb4a]
> [login21:18535] [18]
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0xd5)[0x2abc5919df25]
> [login21:18535] [19]
> /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(Py_Main+0xc61)[0x2abc591b44e1]
> [login21:18535] [20]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x2abc59bccb35]
> [login21:18535] [21] python[0x40071e]
> [login21:18535] *** End of error message ***
> ---------------------------------
>
> I think this error is related to
> https://github.com/STEllAR-GROUP/hpx/issues/949 and
> https://github.com/STEllAR-GROUP/hpx/pull/3129  so maybe the suspend and
> resume functions could be used. However, the documentation says this can
> only be done with one locality.
>
> Does anyone know of a way for interprocess communication to still be
> possible within python, separately from the communication layer provided by
> HPX? Thanks!
>
> Best Regards,
>
> James Vance
>
>
> _______________________________________________
> hpx-users mailing list
> hpx-users@stellar.cct.lsu.edu
> https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
>
_______________________________________________
hpx-users mailing list
hpx-users@stellar.cct.lsu.edu
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Reply via email to