Ralph,
the cxx_win_attr issue is dealt at
https://github.com/open-mpi/ompi/pull/1473
iirc, only big endian and/or sizeof(Fortran integer) > sizeof(int) is
impacted.
the second error seems a bit weirdest at a time.
once in a while, MPI_File_open fails, and when it fails, it always fails
silently.
in this case (MPI_File_open failed), if --mca mpi_param_check true, then
next calls to MPI-IO will also fail silently.
if --mca mpi_param_check false (or Open MPI was configure'd with
--without-mpi-param-check),
then something will go wrong in MPI_File_close
that raises several questions ...
- why does MPI-IO default behavior is to fail silently ?
(point to point or collective will abort by default)
- why does MPI_File_open fails once in a while ?
(Open MPI bug ? ROMIO bug ? intermittent failure caused by the NFS
filesystem ?)
- is there a bug in the test ?
for example, the program could abort with error code 77 (skip) if
MPI_File_open fails
Cheers,
Gilles
On 5/26/2016 11:14 PM, Ralph Castain wrote:
I’m seeing three errors in MTT today - of these, I only consider the
first two to be of significant concern:
onesided/cxx_win_attr :https://mtt.open-mpi.org/index.php?do_redir=2326
[**ERROR**]: MPI_COMM_WORLD rank 0, file cxx_win_attr.cc:50:
Win::Get_attr: Got wrong value for disp
unit--------------------------------------------------------------------------
datatype/idx_null :https://mtt.open-mpi.org/index.php?do_redir=2327
home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x82)[0x2aaaab7ef70a]
[mpi031:06729] [ 2]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x96)[0x2aaaab7ee047]
[mpi031:06729] [ 3]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(+0xd0ed8)[0x2aaaab7eced8]
[mpi031:06729] [ 4]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libmpi.so.12(ompi_file_close+0x101)[0x2aaaaab2963c]
[mpi031:06729] [ 5]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libmpi.so.12(PMPI_File_close+0x18)[0x2aaaaab83216]
[mpi031:06729] [ 6] datatype/idx_null[0x400cb2]
[mpi031:06729] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3c2f21ed1d]
[mpi031:06729] [ 8] datatype/idx_null[0x400a89]
[mpi031:06729] *** End of error message ***
[mpi031:06732] *** Process received signal ***
[mpi031:06732] Signal: Segmentation fault (11)
[mpi031:06732] Signal code: Address not mapped (1)
[mpi031:06732] Failing at address: 0x2ab2aba3cea0
[mpi031:06732] [ 0] /lib64/libpthread.so.0[0x3c2f60f710]
[mpi031:06732] [ 1]
dynamic/loop_spawn :https://mtt.open-mpi.org/index.php?do_redir=2328
[p10a601:159913] too many retries sending message to 0x000b:0x00427ad6, giving
up
-------------------------------------------------------
Child job 8 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---------------------------------------------------------------------------------------------------------------------------------
mp
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/05/19037.php