Ralph,

well, my analysis was a bit too superficial ...

ROMIO uses UFS instead of NFS, very likely caused by a recent change i made :-(

please expect a PR soon


Cheers,


Gilles


On 5/27/2016 12:25 PM, Ralph Castain wrote:
Thanks for analyzing this, Gilles - I guess this is a question for Edgar or someone who cares about MPI-IO. Should we worry about this for 1.10?

I’m inclined to not delay 1.10.3 over this one, but am open to contrary opinions


On May 26, 2016, at 7:22 PM, Gilles Gouaillardet <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:

In my environment, the root cause of MPI_File_open failing seems to be NFS.

MPI_File_open(MPI_COMM_WORLD, "temp", MPI_MODE_RDWR | MPI_MODE_CREATE,
                  MPI_INFO_NULL, &lFile);


if the file does not previously exists, rank 0 creates the file, MPI_Bcast(), and then every rank open the file.

that works fine with all the tasks running on the same node than rank 0, but other nodes fail when opening the file.


i ran some more tests and observe a quite consistent behavior:

on n1:

nc -l 6666 && touch temp

on n0:

echo "" | nc n1 6666 ; while true; do date ; ls -l temp && break ; sleep 1; done


on n0, the temp file is immediatly found, no problem so far.


now, if i run

on n1:

nc -l 6666 && touch temp2

on n0:

ls -l temp2; echo "" | nc n1 6666 ; while true; do date ; ls -l temp2 && break ; sleep 1; done


it takes a few iterations before n0 find temp2.

the only difference is that n0 looked up this file before it was created, and it somehow cache this information

(e.g. the file does not exist), and it takes a while before the cache gets updated (e.g. the file now exists)


i cannot remember whether this is what should be expected from NFS nor if that can be changed with appropriate tuning.


Cheers,


Gilles


On 5/27/2016 10:32 AM, Gilles Gouaillardet wrote:

Ralph,


the cxx_win_attr issue is dealt at https://github.com/open-mpi/ompi/pull/1473

iirc, only big endian and/or sizeof(Fortran integer) > sizeof(int) is impacted.


the second error seems a bit weirdest at a time.

once in a while, MPI_File_open fails, and when it fails, it always fails silently.

in this case (MPI_File_open failed), if --mca mpi_param_check true, then next calls to MPI-IO will also fail silently.

if --mca mpi_param_check false (or Open MPI was configure'd with --without-mpi-param-check),

then something will go wrong in MPI_File_close



that raises several questions ...

- why does MPI-IO default behavior is to fail silently ?

(point to point or collective will abort by default)

- why does MPI_File_open fails once in a while ?

(Open MPI bug ? ROMIO bug ? intermittent failure caused by the NFS filesystem ?)

- is there a bug in the test ?

for example, the program could abort with error code 77 (skip) if MPI_File_open fails


Cheers,


Gilles


On 5/26/2016 11:14 PM, Ralph Castain wrote:
I’m seeing three errors in MTT today - of these, I only consider the first two to be of significant concern:

onesided/cxx_win_attr :https://mtt.open-mpi.org/index.php?do_redir=2326
[**ERROR**]: MPI_COMM_WORLD rank 0, file cxx_win_attr.cc:50:
Win::Get_attr: Got wrong value for disp
unit--------------------------------------------------------------------------
datatype/idx_null :https://mtt.open-mpi.org/index.php?do_redir=2327
home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x82)[0x2aaaab7ef70a]
[mpi031:06729] [ 2]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x96)[0x2aaaab7ee047]
[mpi031:06729] [ 3]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(+0xd0ed8)[0x2aaaab7eced8]
[mpi031:06729] [ 4]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libmpi.so.12(ompi_file_close+0x101)[0x2aaaaab2963c]
[mpi031:06729] [ 5]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libmpi.so.12(PMPI_File_close+0x18)[0x2aaaaab83216]
[mpi031:06729] [ 6] datatype/idx_null[0x400cb2]
[mpi031:06729] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3c2f21ed1d]
[mpi031:06729] [ 8] datatype/idx_null[0x400a89]
[mpi031:06729] *** End of error message ***
[mpi031:06732] *** Process received signal ***
[mpi031:06732] Signal: Segmentation fault (11)
[mpi031:06732] Signal code: Address not mapped (1)
[mpi031:06732] Failing at address: 0x2ab2aba3cea0
[mpi031:06732] [ 0] /lib64/libpthread.so.0[0x3c2f60f710]
[mpi031:06732] [ 1]
dynamic/loop_spawn :https://mtt.open-mpi.org/index.php?do_redir=2328
[p10a601:159913] too many retries sending message to 0x000b:0x00427ad6, giving 
up
-------------------------------------------------------
Child job 8 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---------------------------------------------------------------------------------------------------------------------------------
mp



_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this 
post:http://www.open-mpi.org/community/lists/devel/2016/05/19037.php



_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this 
post:http://www.open-mpi.org/community/lists/devel/2016/05/19040.php

_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: http://www.open-mpi.org/community/lists/devel/2016/05/19041.php



_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/05/19044.php

Reply via email to