We always have the possibility to fail the MPI_Comm_connect. There is a specific error for this MPI_ERR_PORT. We can detect that the port is not available anymore (whatever the reason is), by simply using the TCP timeout on the connection. It's the best we can, and this will give us a simplified way of handling things ...

  george.

On Apr 25, 2008, at 7:52 PM, Ralph Castain wrote:

On 4/25/08 5:38 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> wrote:

To bounce on last George remark, currently when a job dies without
unsubscribing a port with Unpublish(due to poor user programming,
failure or abort), ompi-server keeps the reference forever and a new
application can therefore not publish under the same name again. So I
guess this is a good point to cleanup correctly all published/opened
ports, when the application is ended (for whatever reason).

That's a good point - in my other note, all I had addressed was closing my local port. We should ensure that the pubsub framework does an unpublish of anything we put out there. I'll have to create a command to do that since pubsub doesn't actually track what it was asked to publish - we'll need
something that tells both local and global data servers to "unpublish
anything that came from me".


Another cool feature could be to have mpirun behave as an ompi- server,
and publish a suitable URI if requested to do so (if the urifile does
not exist yet ?). I know from the source code that mpirun is already
including anything needed to offer this feature, exept the ability to
provide a suitable URI.

Just to be sure I understand, since I think this is doable. Mpirun already does serve as your "ompi-server" for any job it spawns - that is the purpose
of the MPI_Info flag "local" instead of "global" when you publish
information. You can always publish/lookup against your own mpirun.

What you are suggesting here is that we have each mpirun put its local data
server port info somewhere that another job can find it, either in the
already existing contact_info file, or perhaps in a separate "data server
uri" file?

The only reason for concern here is the obvious race condition. Since mpirun only exists during the time a job is running, you could lookup its contact info and attempt to publish/lookup to that mpirun, only to find it doesn't
respond because it either is already dead or on its way out. Hence the
notion of restricting inter-job operations to the system-level ompi- server.

If we can think of a way to deal with the race condition, I'm certainly willing to publish the contact info. I'm just concerned that you may find yourself "hung" if that mpirun goes away unexpectedly - say right in the
middle of a publish/lookup operation.

Ralph


  Aurelien

Le 25 avr. 08 à 19:19, George Bosilca a écrit :

Ralph,

Thanks for your concern regarding the level of compliance of our
implementation of the MPI standard. I don't know who were the MPI
gurus you talked with about this issue, but I can tell that for once
the MPI standard is pretty clear about this.

As stated by Aurelien in his last email, using the plural in several
sentences, strongly suggest that the status of port should not be
implicitly modified by MPI_Comm_accept or MPI_Comm_connect.
Moreover, in the beginning of the chapter in the MPI standard, it is
specified that comm/accept work exactly as in TCP. In other words,
once the port is opened it stay open until the user explicitly close
it.

However, not all corner cases are addressed by the MPI standard.
What happens on MPI_Finalize ... it's a good question. Personally, I
think we should stick with the TCP similarities. The port should be
not only closed by unpublished. This will solve all issues with
people trying to lookup a port once the originator is gone.

george.

On Apr 25, 2008, at 5:25 PM, Ralph Castain wrote:

As I said, it makes no difference to me. I just want to ensure that
everyone
agrees on the interpretation of the MPI standard. We have had these
discussion in the past, with differing views. My guess here is that
the port
was left open mostly because the person who wrote the C-binding
forgot to
close it. ;-)

So, you MPI folks: do we allow multiple connections against a
single port,
and leave the port open until explicitly closed? If so, then do we
generate
an error if someone calls MPI_Finalize without first closing the
port? Or do
we automatically close any open ports when finalize is called?

Or do we automatically close the port after the connect/accept is
completed?

Thanks
Ralph



On 4/25/08 3:13 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu>
wrote:

Actually, the port was still left open forever before the change.
The
bug damaged the port string, and it was not usable anymore, not only in subsequent Comm_accept, but also in Close_port or Unpublish_name.

To more specifically answer to your open port concern, if the user
does not want to have an open port anymore, he should specifically
call MPI_Close_port and not rely on MPI_Comm_accept to close it.
Actually the standard suggests the exact contrary: section 5.4.2
states "it must call MPI_Open_port to establish a port [...] it must
call MPI_Comm_accept to accept connections from clients". Because
there is multiple clients AND multiple connections in that
sentence, I
assume the port can be used in multiple accepts.

Aurelien

Le 25 avr. 08 à 16:53, Ralph Castain a écrit :

Hmmm...just to clarify, this wasn't a "bug". It was my
understanding
per the
MPI folks that a separate, unique port had to be created for every
invocation of Comm_accept. They didn't want a port hanging around
open, and
their plan was to close the port immediately after the connection
was
established.

So dpm_orte was written to that specification. When I reorganized
the code,
I left the logic as it had been written - which was actually done
by
the MPI
side of the house, not me.

I have no problem with making the change. However, since the
specification
was created on the MPI side, I just want to make sure that the MPI
folks all
realize this has now been changed. Obviously, if this change in
spec
is
adopted, someone needs to make sure that the C and Fortran
bindings -
do not-
close that port any more!

Ralph



On 4/25/08 2:41 PM, "boute...@osl.iu.edu" <boute...@osl.iu.edu>
wrote:

Author: bouteill
Date: 2008-04-25 16:41:44 EDT (Fri, 25 Apr 2008)
New Revision: 18303
URL: https://svn.open-mpi.org/trac/ompi/changeset/18303

Log:
Fix a bug that rpevented to use the same port (as returned by
Open_port) for
several Comm_accept)


Text files modified:
trunk/ompi/mca/dpm/orte/dpm_orte.c |    19 ++++++++++---------
1 files changed, 10 insertions(+), 9 deletions(-)

Modified: trunk/ompi/mca/dpm/orte/dpm_orte.c
=
=
=
=
=
=
=
=
=
=
=
=
= = ================================================================
--- trunk/ompi/mca/dpm/orte/dpm_orte.c (original)
+++ trunk/ompi/mca/dpm/orte/dpm_orte.c 2008-04-25 16:41:44 EDT
(Fri, 25 Apr
2008)
@@ -848,8 +848,14 @@
{
 char *tmp_string, *ptr;

+    /* copy the RML uri so we can return a malloc'd value
+     * that can later be free'd
+     */
+    tmp_string = strdup(port_name);
+
 /* find the ':' demarking the RML tag we added to the end */
-    if (NULL == (ptr = strrchr(port_name, ':'))) {
+    if (NULL == (ptr = strrchr(tmp_string, ':'))) {
+        free(tmp_string);
     return NULL;
 }

@@ -863,15 +869,10 @@
 /* see if the length of the RML uri is too long - if so,
  * truncate it
  */
-    if (strlen(port_name) > MPI_MAX_PORT_NAME) {
-        port_name[MPI_MAX_PORT_NAME] = '\0';
+    if (strlen(tmp_string) > MPI_MAX_PORT_NAME) {
+        tmp_string[MPI_MAX_PORT_NAME] = '\0';
 }
-
-    /* copy the RML uri so we can return a malloc'd value
-     * that can later be free'd
-     */
-    tmp_string = strdup(port_name);
-
+
 return tmp_string;
}

_______________________________________________
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to