I have seen that ROMIO goes wrong with fix 2014: A lot of ROMIO tests in
ompi/mca/io/romio/romio/test/ are failing
For example, with noncontig_coll2:

[inti15:28259] *** Process received signal ***
[inti15:28259] Signal: Segmentation fault (11)
[inti15:28259] Signal code: Address not mapped (1)
[inti15:28259] Failing at address: (nil)
[inti15:28259] [ 0] /lib64/libpthread.so.0 [0x3f19c0e4c0]
[inti15:28259] [ 1]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_btl_openib.so
[0x2b6640c74d79]
[inti15:28259] [ 2]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_rml_oob.so
[0x2b663e2e6e92]
[inti15:28259] [ 3]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_oob_tcp.so
[0x2b663e4f8e63]
[inti15:28259] [ 4]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_oob_tcp.so
[0x2b663e4ff485]
[inti15:28259] [ 5]
/home_nfs/devezep/ATLAS/openmpi-default/lib/libopen-pal.so.0(opal_event_loop+0x5df)
 [0x2b663d3d92ff]
[inti15:28259] [ 6]
/home_nfs/devezep/ATLAS/openmpi-default/lib/libopen-pal.so.0(opal_progress+0x5e)
 [0x2b663d3ba33e]
[inti15:28259] [ 7] /home_nfs/devezep/ATLAS/openmpi-default/lib/libmpi.so.0
[0x2b663ce26624]
[inti15:28259] [ 8]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_coll_tuned.so
[0x2b664217fda2]
[inti15:28259] [ 9]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_coll_tuned.so
[0x2b6642179966]
[inti15:28259] [10]
/home_nfs/devezep/ATLAS/openmpi-default/lib/libmpi.so.0(MPI_Alltoall+0x6f)
[0x2b663ce352ef]
[inti15:28259] [11]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(ADIOI_Calc_others_req+0x65)
 [0x2aaab1cfc525]
[inti15:28259] [12]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(ADIOI_GEN_WriteStridedColl+0x433)
 [0x2aaab1cf0ac3]
[inti15:28259] [13]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(MPIOI_File_write_all+0xc0)
 [0x2aaab1d0a8f0]
[inti15:28259] [14]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(mca_io_romio_dist_MPI_File_write_all+0x23)
 [0x2aaab1d0a823]
[inti15:28259] [15]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so
[0x2aaab1cedce9]
[inti15:28259] [16]
/home_nfs/devezep/ATLAS/openmpi-default/lib/libmpi.so.0(MPI_File_write_all+0x4e)
 [0x2b663ce64f9e]
[inti15:28259] [17] ./noncontig_coll2(test_file+0x32b) [0x4034bb]
[inti15:28259] [18] ./noncontig_coll2(main+0x58b) [0x402d03]
[inti15:28259] [19] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3f1901d974]
[inti15:28259] [20] ./noncontig_coll2 [0x4026c9]
[inti15:28259] *** End of error message ***

All the ROMIO tests pass without this fix

Is there a problem in ROMIO with the datatype interface ?

Pascal

Here is the export of the corresponding patch:

hg export 16301
# HG changeset patch
# User rusraink
# Date 1251912841 0
# Node ID eefd4bd4551969dc7454e63c2f42871cc9376a8f
# Parent  8aab76743e58474f1341be6f9d0ac9ae338507f1
 - This fixes #2014:
   As noted in
http://www.open-mpi.org/community/lists/devel/2009/08/6741.php,
   we do not correctly free a dupped predefined datatype.
   The fix is a bit more involving. See ticket for details.
   Tested with ibm tests and mpi_test_suite (though there's two "old"
failures
   zero5.c and zero6.c)

   Thanks to Lisandro Dalcin for bringing this up.

diff -r 8aab76743e58 -r eefd4bd45519 ompi/datatype/ompi_datatype.h
--- a/ompi/datatype/ompi_datatype.h     Wed Sep 02 11:23:54 2009 +0000
+++ b/ompi/datatype/ompi_datatype.h     Wed Sep 02 17:34:01 2009 +0000
@@ -202,11 +202,14 @@
     }
     opal_datatype_clone ( &oldType->super, &new_ompi_datatype->super);

+    new_ompi_datatype->super.flags &= (~OMPI_DATATYPE_FLAG_PREDEFINED);
+
     /* Set the keyhash to NULL -- copying attributes is *only* done at
        the top level (specifically, MPI_TYPE_DUP). */
     new_ompi_datatype->d_keyhash = NULL;
     new_ompi_datatype->args = NULL;
-    strncpy (new_ompi_datatype->name, oldType->name, MPI_MAX_OBJECT_NAME);
+    snprintf (new_ompi_datatype->name, MPI_MAX_OBJECT_NAME, "Dup %s",
+              oldType->name);

     return OMPI_SUCCESS;
 }
diff -r 8aab76743e58 -r eefd4bd45519 opal/datatype/opal_datatype_clone.c
--- a/opal/datatype/opal_datatype_clone.c       Wed Sep 02 11:23:54 2009
+0000
+++ b/opal/datatype/opal_datatype_clone.c       Wed Sep 02 17:34:01 2009
+0000
@@ -33,9 +33,13 @@
 int32_t opal_datatype_clone( const opal_datatype_t * src_type,
opal_datatype_t * dest_type )
 {
     int32_t desc_length = src_type->desc.used + 1;  /* +1 because of the
fake OPAL_DATATYPE_END_LOOP entry */
-    dt_elem_desc_t* temp = dest_type->desc.desc; /* temporary copy of the
desc pointer */
+    dt_elem_desc_t* temp = dest_type->desc.desc;    /* temporary copy of
the desc pointer */

-    memcpy( dest_type, src_type, sizeof(opal_datatype_t) );
+    /* copy _excluding_ the super object, we want to keep the
cls_destruct_array */
+    memcpy( dest_type+sizeof(opal_object_t),
+            src_type+sizeof(opal_object_t),
+            sizeof(opal_datatype_t)-sizeof(opal_object_t) );
+
     dest_type->super.obj_reference_count = 1;
     dest_type->desc.desc = temp;
     dest_type->flags &= (~OPAL_DATATYPE_FLAG_PREDEFINED);



Reply via email to