Ralph,

the cxx_win_attr issue is dealt at https://github.com/open-mpi/ompi/pull/1473

iirc, only big endian and/or sizeof(Fortran integer) > sizeof(int) is impacted.


the second error seems a bit weirdest at a time.

once in a while, MPI_File_open fails, and when it fails, it always fails silently.

in this case (MPI_File_open failed), if --mca mpi_param_check true, then next calls to MPI-IO will also fail silently.

if --mca mpi_param_check false (or Open MPI was configure'd with --without-mpi-param-check),

then something will go wrong in MPI_File_close



that raises several questions ...

- why does MPI-IO default behavior is to fail silently ?

(point to point or collective will abort by default)

- why does MPI_File_open fails once in a while ?

(Open MPI bug ? ROMIO bug ? intermittent failure caused by the NFS filesystem ?)

- is there a bug in the test ?

for example, the program could abort with error code 77 (skip) if MPI_File_open fails


Cheers,


Gilles


On 5/26/2016 11:14 PM, Ralph Castain wrote:
I’m seeing three errors in MTT today - of these, I only consider the first two to be of significant concern:

onesided/cxx_win_attr :https://mtt.open-mpi.org/index.php?do_redir=2326
[**ERROR**]: MPI_COMM_WORLD rank 0, file cxx_win_attr.cc:50:
Win::Get_attr: Got wrong value for disp
unit--------------------------------------------------------------------------
datatype/idx_null :https://mtt.open-mpi.org/index.php?do_redir=2327
home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x82)[0x2aaaab7ef70a]
[mpi031:06729] [ 2]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x96)[0x2aaaab7ee047]
[mpi031:06729] [ 3]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(+0xd0ed8)[0x2aaaab7eced8]
[mpi031:06729] [ 4]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libmpi.so.12(ompi_file_close+0x101)[0x2aaaaab2963c]
[mpi031:06729] [ 5]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libmpi.so.12(PMPI_File_close+0x18)[0x2aaaaab83216]
[mpi031:06729] [ 6] datatype/idx_null[0x400cb2]
[mpi031:06729] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3c2f21ed1d]
[mpi031:06729] [ 8] datatype/idx_null[0x400a89]
[mpi031:06729] *** End of error message ***
[mpi031:06732] *** Process received signal ***
[mpi031:06732] Signal: Segmentation fault (11)
[mpi031:06732] Signal code: Address not mapped (1)
[mpi031:06732] Failing at address: 0x2ab2aba3cea0
[mpi031:06732] [ 0] /lib64/libpthread.so.0[0x3c2f60f710]
[mpi031:06732] [ 1]
dynamic/loop_spawn :https://mtt.open-mpi.org/index.php?do_redir=2328
[p10a601:159913] too many retries sending message to 0x000b:0x00427ad6, giving 
up
-------------------------------------------------------
Child job 8 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---------------------------------------------------------------------------------------------------------------------------------
mp



_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/05/19037.php

Reply via email to