I have seen that ROMIO goes wrong with fix 2014: A lot of ROMIO tests in ompi/mca/io/romio/romio/test/ are failing For example, with noncontig_coll2:
[inti15:28259] *** Process received signal *** [inti15:28259] Signal: Segmentation fault (11) [inti15:28259] Signal code: Address not mapped (1) [inti15:28259] Failing at address: (nil) [inti15:28259] [ 0] /lib64/libpthread.so.0 [0x3f19c0e4c0] [inti15:28259] [ 1] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_btl_openib.so [0x2b6640c74d79] [inti15:28259] [ 2] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_rml_oob.so [0x2b663e2e6e92] [inti15:28259] [ 3] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_oob_tcp.so [0x2b663e4f8e63] [inti15:28259] [ 4] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_oob_tcp.so [0x2b663e4ff485] [inti15:28259] [ 5] /home_nfs/devezep/ATLAS/openmpi-default/lib/libopen-pal.so.0(opal_event_loop+0x5df) [0x2b663d3d92ff] [inti15:28259] [ 6] /home_nfs/devezep/ATLAS/openmpi-default/lib/libopen-pal.so.0(opal_progress+0x5e) [0x2b663d3ba33e] [inti15:28259] [ 7] /home_nfs/devezep/ATLAS/openmpi-default/lib/libmpi.so.0 [0x2b663ce26624] [inti15:28259] [ 8] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_coll_tuned.so [0x2b664217fda2] [inti15:28259] [ 9] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_coll_tuned.so [0x2b6642179966] [inti15:28259] [10] /home_nfs/devezep/ATLAS/openmpi-default/lib/libmpi.so.0(MPI_Alltoall+0x6f) [0x2b663ce352ef] [inti15:28259] [11] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(ADIOI_Calc_others_req+0x65) [0x2aaab1cfc525] [inti15:28259] [12] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(ADIOI_GEN_WriteStridedColl+0x433) [0x2aaab1cf0ac3] [inti15:28259] [13] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(MPIOI_File_write_all+0xc0) [0x2aaab1d0a8f0] [inti15:28259] [14] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(mca_io_romio_dist_MPI_File_write_all+0x23) [0x2aaab1d0a823] [inti15:28259] [15] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so [0x2aaab1cedce9] [inti15:28259] [16] /home_nfs/devezep/ATLAS/openmpi-default/lib/libmpi.so.0(MPI_File_write_all+0x4e) [0x2b663ce64f9e] [inti15:28259] [17] ./noncontig_coll2(test_file+0x32b) [0x4034bb] [inti15:28259] [18] ./noncontig_coll2(main+0x58b) [0x402d03] [inti15:28259] [19] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3f1901d974] [inti15:28259] [20] ./noncontig_coll2 [0x4026c9] [inti15:28259] *** End of error message *** All the ROMIO tests pass without this fix Is there a problem in ROMIO with the datatype interface ? Pascal Here is the export of the corresponding patch: hg export 16301 # HG changeset patch # User rusraink # Date 1251912841 0 # Node ID eefd4bd4551969dc7454e63c2f42871cc9376a8f # Parent 8aab76743e58474f1341be6f9d0ac9ae338507f1 - This fixes #2014: As noted in http://www.open-mpi.org/community/lists/devel/2009/08/6741.php, we do not correctly free a dupped predefined datatype. The fix is a bit more involving. See ticket for details. Tested with ibm tests and mpi_test_suite (though there's two "old" failures zero5.c and zero6.c) Thanks to Lisandro Dalcin for bringing this up. diff -r 8aab76743e58 -r eefd4bd45519 ompi/datatype/ompi_datatype.h --- a/ompi/datatype/ompi_datatype.h Wed Sep 02 11:23:54 2009 +0000 +++ b/ompi/datatype/ompi_datatype.h Wed Sep 02 17:34:01 2009 +0000 @@ -202,11 +202,14 @@ } opal_datatype_clone ( &oldType->super, &new_ompi_datatype->super); + new_ompi_datatype->super.flags &= (~OMPI_DATATYPE_FLAG_PREDEFINED); + /* Set the keyhash to NULL -- copying attributes is *only* done at the top level (specifically, MPI_TYPE_DUP). */ new_ompi_datatype->d_keyhash = NULL; new_ompi_datatype->args = NULL; - strncpy (new_ompi_datatype->name, oldType->name, MPI_MAX_OBJECT_NAME); + snprintf (new_ompi_datatype->name, MPI_MAX_OBJECT_NAME, "Dup %s", + oldType->name); return OMPI_SUCCESS; } diff -r 8aab76743e58 -r eefd4bd45519 opal/datatype/opal_datatype_clone.c --- a/opal/datatype/opal_datatype_clone.c Wed Sep 02 11:23:54 2009 +0000 +++ b/opal/datatype/opal_datatype_clone.c Wed Sep 02 17:34:01 2009 +0000 @@ -33,9 +33,13 @@ int32_t opal_datatype_clone( const opal_datatype_t * src_type, opal_datatype_t * dest_type ) { int32_t desc_length = src_type->desc.used + 1; /* +1 because of the fake OPAL_DATATYPE_END_LOOP entry */ - dt_elem_desc_t* temp = dest_type->desc.desc; /* temporary copy of the desc pointer */ + dt_elem_desc_t* temp = dest_type->desc.desc; /* temporary copy of the desc pointer */ - memcpy( dest_type, src_type, sizeof(opal_datatype_t) ); + /* copy _excluding_ the super object, we want to keep the cls_destruct_array */ + memcpy( dest_type+sizeof(opal_object_t), + src_type+sizeof(opal_object_t), + sizeof(opal_datatype_t)-sizeof(opal_object_t) ); + dest_type->super.obj_reference_count = 1; dest_type->desc.desc = temp; dest_type->flags &= (~OPAL_DATATYPE_FLAG_PREDEFINED);