Ralph,
well, my analysis was a bit too superficial ...
ROMIO uses UFS instead of NFS, very likely caused by a recent change i
made :-(
please expect a PR soon
Cheers,
Gilles
On 5/27/2016 12:25 PM, Ralph Castain wrote:
Thanks for analyzing this, Gilles - I guess this is a question for
Edgar or someone who cares about MPI-IO. Should we worry about this
for 1.10?
I’m inclined to not delay 1.10.3 over this one, but am open to
contrary opinions
On May 26, 2016, at 7:22 PM, Gilles Gouaillardet <gil...@rist.or.jp
<mailto:gil...@rist.or.jp>> wrote:
In my environment, the root cause of MPI_File_open failing seems to
be NFS.
MPI_File_open(MPI_COMM_WORLD, "temp", MPI_MODE_RDWR | MPI_MODE_CREATE,
MPI_INFO_NULL, &lFile);
if the file does not previously exists, rank 0 creates the file,
MPI_Bcast(), and then every rank open the file.
that works fine with all the tasks running on the same node than rank
0, but other nodes fail when opening the file.
i ran some more tests and observe a quite consistent behavior:
on n1:
nc -l 6666 && touch temp
on n0:
echo "" | nc n1 6666 ; while true; do date ; ls -l temp && break ;
sleep 1; done
on n0, the temp file is immediatly found, no problem so far.
now, if i run
on n1:
nc -l 6666 && touch temp2
on n0:
ls -l temp2; echo "" | nc n1 6666 ; while true; do date ; ls -l temp2
&& break ; sleep 1; done
it takes a few iterations before n0 find temp2.
the only difference is that n0 looked up this file before it was
created, and it somehow cache this information
(e.g. the file does not exist), and it takes a while before the cache
gets updated (e.g. the file now exists)
i cannot remember whether this is what should be expected from NFS
nor if that can be changed with appropriate tuning.
Cheers,
Gilles
On 5/27/2016 10:32 AM, Gilles Gouaillardet wrote:
Ralph,
the cxx_win_attr issue is dealt at
https://github.com/open-mpi/ompi/pull/1473
iirc, only big endian and/or sizeof(Fortran integer) > sizeof(int)
is impacted.
the second error seems a bit weirdest at a time.
once in a while, MPI_File_open fails, and when it fails, it always
fails silently.
in this case (MPI_File_open failed), if --mca mpi_param_check true,
then next calls to MPI-IO will also fail silently.
if --mca mpi_param_check false (or Open MPI was configure'd with
--without-mpi-param-check),
then something will go wrong in MPI_File_close
that raises several questions ...
- why does MPI-IO default behavior is to fail silently ?
(point to point or collective will abort by default)
- why does MPI_File_open fails once in a while ?
(Open MPI bug ? ROMIO bug ? intermittent failure caused by the NFS
filesystem ?)
- is there a bug in the test ?
for example, the program could abort with error code 77 (skip) if
MPI_File_open fails
Cheers,
Gilles
On 5/26/2016 11:14 PM, Ralph Castain wrote:
I’m seeing three errors in MTT today - of these, I only consider
the first two to be of significant concern:
onesided/cxx_win_attr :https://mtt.open-mpi.org/index.php?do_redir=2326
[**ERROR**]: MPI_COMM_WORLD rank 0, file cxx_win_attr.cc:50:
Win::Get_attr: Got wrong value for disp
unit--------------------------------------------------------------------------
datatype/idx_null :https://mtt.open-mpi.org/index.php?do_redir=2327
home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x82)[0x2aaaab7ef70a]
[mpi031:06729] [ 2]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x96)[0x2aaaab7ee047]
[mpi031:06729] [ 3]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libopen-pal.so.13(+0xd0ed8)[0x2aaaab7eced8]
[mpi031:06729] [ 4]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libmpi.so.12(ompi_file_close+0x101)[0x2aaaaab2963c]
[mpi031:06729] [ 5]
/home/mpiteam/scratches/community/2016-05-25cron/56jr/installs/i0Lt/install/lib/libmpi.so.12(PMPI_File_close+0x18)[0x2aaaaab83216]
[mpi031:06729] [ 6] datatype/idx_null[0x400cb2]
[mpi031:06729] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x3c2f21ed1d]
[mpi031:06729] [ 8] datatype/idx_null[0x400a89]
[mpi031:06729] *** End of error message ***
[mpi031:06732] *** Process received signal ***
[mpi031:06732] Signal: Segmentation fault (11)
[mpi031:06732] Signal code: Address not mapped (1)
[mpi031:06732] Failing at address: 0x2ab2aba3cea0
[mpi031:06732] [ 0] /lib64/libpthread.so.0[0x3c2f60f710]
[mpi031:06732] [ 1]
dynamic/loop_spawn :https://mtt.open-mpi.org/index.php?do_redir=2328
[p10a601:159913] too many retries sending message to 0x000b:0x00427ad6, giving
up
-------------------------------------------------------
Child job 8 terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---------------------------------------------------------------------------------------------------------------------------------
mp
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this
post:http://www.open-mpi.org/community/lists/devel/2016/05/19037.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this
post:http://www.open-mpi.org/community/lists/devel/2016/05/19040.php
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/05/19041.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/05/19044.php