Ralph,

i could not find anything wrong with loop_spawn and unless i am missing
something obvious :

from mtt http://mtt.open-mpi.org/index.php?do_redir=2196

all tests ran this month (both trunk and v1.8) failed (timeout) and there
was no error message such as
dpm_base_disconnect_init: error -12 in isend to process 1

loop_spawn tries to spawn 2000 tasks in 10 minutes.
my system is not fast enough to achieve this so the iteration count is
bumped
/* if time exceeded, then bump iteration count to the end */

the test would success in 10 minutes and a few seconds ( required to
complete the last spawn and MPI_Finalize())

the slurm timeout is set to 10 minutes exactly, so the job is aborted
before it has time to finish (and i believe it would have finished
successfully)

you can either increase the slurm timeout (10min30s looks good to me),
decrease nseconds (570 looks good to me) in loop_spawn.c or run
mpirun ... dynamic/loop_spawn <nseconds>
where nseconds is "a bit less" than 600 seconds (once again, 570 looks good
to me)

did i miss something ?

Cheers,

Gilles


On Wed, May 28, 2014 at 12:53 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

> Ralph,
>
>
> On 2014/05/28 12:10, Ralph Castain wrote:
> > my understanding is that there are two ways of seeing things :
> > a) the "R-way" : the problem is the parent should not try to communicate
> to already exited processes
> > b) the "J-way" : the problem is the children should have waited either
> in MPI_Comm_free() or MPI_Finalize()
> > I don't think you can use option (b) - we can't have the children
> lingering around for the parent to call finalize, if I'm understanding you
> correctly.
> you understood me correctly.
>
> once again, i did not start investigating loop_spawn.
>
> in the case of intercomm_create, we would not run into this if the
> application had explicitly called MPI_Comm_free in the parent.
> so in this case *only*, and as explained by Jeff, b) could be an option
> to make OpenMPI happy.
> (to be blunt : if the user is not happy with children lingering around,
> he can explicitly call MPI_Comm_free before calling MPI_Comm_disconnect)
>
> i will start investigating loop_spawn from now
>
> Cheers,
>
> Gilles
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14879.php
>

Reply via email to