Ups, sended to wrong list, forwarded here... ---------- Forwarded message ---------- From: Lisandro Dalcin <dalc...@gmail.com> List-Post: devel@lists.open-mpi.org Date: Jul 11, 2007 8:58 PM Subject: failures runing mpi4py testsuite, perhaps Comm.Split() To: Open MPI <b...@open-mpi.org>
Hello all, after a long time I'm here again. I am improving mpi4py in order to support MPI threads, and I've found some problem with latest version 1.2.3 I've configured with: $ ./configure --prefix /usr/local/openmpi/1.2.3 --enable-mpi-threads --disable-dependency-tracking However, for the following fail, MPI_Init_thread() was not used. This test creates a intercommunicator by using Comm.Split() followed by Intracomm.Create_intercomm(). When running in two or more procs (for one proc this test is skipped), I got (sometimes) the following trace [trantor:06601] *** Process received signal *** [trantor:06601] Signal: Segmentation fault (11) [trantor:06601] Signal code: Address not mapped (1) [trantor:06601] Failing at address: 0xa8 [trantor:06601] [ 0] [0x958440] [trantor:06601] [ 1] /usr/local/openmpi/1.2.3/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x1483) [0x995553] [trantor:06601] [ 2] /usr/local/openmpi/1.2.3/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x36) [0x645d06] [trantor:06601] [ 3] /usr/local/openmpi/1.2.3/lib/libopen-pal.so.0(opal_progress+0x58) [0x1a2c88] [trantor:06601] [ 4] /usr/local/openmpi/1.2.3/lib/libmpi.so.0(ompi_request_wait_all+0xea) [0x140a8a] [trantor:06601] [ 5] /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual+0xc8) [0x22d6e8] [trantor:06601] [ 6] /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allgather_intra_bruck+0xf2) [0x231ca2] [trantor:06601] [ 7] /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allgather_intra_dec_fixed+0x8b) [0x22db7b] [trantor:06601] [ 8] /usr/local/openmpi/1.2.3/lib/libmpi.so.0(ompi_comm_split+0x9d) [0x12d92d] [trantor:06601] [ 9] /usr/local/openmpi/1.2.3/lib/libmpi.so.0(MPI_Comm_split+0xad) [0x15a53d] [trantor:06601] [10] /u/dalcinl/lib/python/mpi4py/_mpi.so [0x508500] [trantor:06601] [11] /usr/local/lib/libpython2.5.so.1.0(PyCFunction_Call+0x14d) [0xe150ad] [trantor:06601] [12] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x64af) [0xe626bf] [trantor:06601] [13] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814] [trantor:06601] [14] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x5a43) [0xe61c53] [trantor:06601] [15] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x6130) [0xe62340] [trantor:06601] [16] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814] [trantor:06601] [17] /usr/local/lib/libpython2.5.so.1.0 [0xe01450] [trantor:06601] [18] /usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7] [trantor:06601] [19] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x42eb) [0xe604fb] [trantor:06601] [20] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814] [trantor:06601] [21] /usr/local/lib/libpython2.5.so.1.0 [0xe0137a] [trantor:06601] [22] /usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7] [trantor:06601] [23] /usr/local/lib/libpython2.5.so.1.0 [0xde6de5] [trantor:06601] [24] /usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7] [trantor:06601] [25] /usr/local/lib/libpython2.5.so.1.0 [0xe2abc9] [trantor:06601] [26] /usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x37) [0xddf5c7] [trantor:06601] [27] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x1481) [0xe5d691] [trantor:06601] [28] /usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x7c4) [0xe63814] [trantor:06601] [29] /usr/local/lib/libpython2.5.so.1.0 [0xe01450] [trantor:06601] *** End of error message *** As the problem seems to originate in Comm.Split(), I've written a small python script to test it:: from mpi4py import MPI # true MPI_COMM_WORLD_HANDLE BASECOMM = MPI.__COMM_WORLD__ BASE_SIZE = BASECOMM.Get_size() BASE_RANK = BASECOMM.Get_rank() if BASE_RANK < (BASE_SIZE // 2) : COLOR = 0 else: COLOR = 1 INTRACOMM = BASECOMM.Split(COLOR, key=0) print 'Done!!!' This seems always work, but running it under valgrind (note valgrind-py below is just an alias adding a suppression file for python) I get the following: mpiexec -n 3 valgrind-py python test.py =6727== Warning: set address range perms: large range 134217728 (defined) ==6727== Source and destination overlap in memcpy(0x4C93EA0, 0x4C93EA8, 16) ==6727== at 0x4006CE6: memcpy (mc_replace_strmem.c:116) ==6727== by 0x46C59CA: ompi_ddt_copy_content_same_ddt (in /usr/local/openmpi/1.2.3/lib/libmpi.so.0.0.0) ==6727== by 0x4BADDCE: ompi_coll_tuned_allgather_intra_bruck (in /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so) ==6727== by 0x4BA9B7A: ompi_coll_tuned_allgather_intra_dec_fixed (in /usr/local/openmpi/1.2.3/lib/openmpi/mca_coll_tuned.so) ==6727== by 0x46A692C: ompi_comm_split (in /usr/local/openmpi/1.2.3/lib/libmpi.so.0.0.0) ==6727== by 0x46D353C: PMPI_Comm_split (in /usr/local/openmpi/1.2.3/lib/libmpi.so.0.0.0) ==6727== by 0x46754FF: comm_split (in /u/dalcinl/lib/python/mpi4py/_mpi.so) ==6727== by 0x407D0AC: PyCFunction_Call (methodobject.c:108) ==6727== by 0x40CA6BE: PyEval_EvalFrameEx (ceval.c:3564) ==6727== by 0x40CB813: PyEval_EvalCodeEx (ceval.c:2831) ==6727== by 0x40C9C52: PyEval_EvalFrameEx (ceval.c:3660) ==6727== by 0x40CB813: PyEval_EvalCodeEx (ceval.c:2831) Done!!! Done!!! Done!!! I hope you can figure what is going on. If you need additional info/tests let me know. I have other issues, but that's for tomorrow. Regards, -- Lisandro Dalcín --------------- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 -- Lisandro Dalcín --------------- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594