True - and I'm all for simple! Unless someone objects, let's just leave it that way for now.
I'll put on my list to look at this later - maybe count how many publishes we do vs unpublishes, and if there is a residual at finalize, then send the "unpublish all" message. Still leaves a race condition, though, so the fail on timeout is always going to have to be there anyway. Will ponder and revisit later... Thanks! Ralph On 4/25/08 7:26 PM, "George Bosilca" <bosi...@eecs.utk.edu> wrote: > We always have the possibility to fail the MPI_Comm_connect. There is > a specific error for this MPI_ERR_PORT. We can detect that the port is > not available anymore (whatever the reason is), by simply using the > TCP timeout on the connection. It's the best we can, and this will > give us a simplified way of handling things ... > > george. > > On Apr 25, 2008, at 7:52 PM, Ralph Castain wrote: > >> On 4/25/08 5:38 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> >> wrote: >> >>> To bounce on last George remark, currently when a job dies without >>> unsubscribing a port with Unpublish(due to poor user programming, >>> failure or abort), ompi-server keeps the reference forever and a new >>> application can therefore not publish under the same name again. So I >>> guess this is a good point to cleanup correctly all published/opened >>> ports, when the application is ended (for whatever reason). >> >> That's a good point - in my other note, all I had addressed was >> closing my >> local port. We should ensure that the pubsub framework does an >> unpublish of >> anything we put out there. I'll have to create a command to do that >> since >> pubsub doesn't actually track what it was asked to publish - we'll >> need >> something that tells both local and global data servers to "unpublish >> anything that came from me". >> >>> >>> Another cool feature could be to have mpirun behave as an ompi- >>> server, >>> and publish a suitable URI if requested to do so (if the urifile does >>> not exist yet ?). I know from the source code that mpirun is already >>> including anything needed to offer this feature, exept the ability to >>> provide a suitable URI. >> >> Just to be sure I understand, since I think this is doable. Mpirun >> already >> does serve as your "ompi-server" for any job it spawns - that is the >> purpose >> of the MPI_Info flag "local" instead of "global" when you publish >> information. You can always publish/lookup against your own mpirun. >> >> What you are suggesting here is that we have each mpirun put its >> local data >> server port info somewhere that another job can find it, either in the >> already existing contact_info file, or perhaps in a separate "data >> server >> uri" file? >> >> The only reason for concern here is the obvious race condition. >> Since mpirun >> only exists during the time a job is running, you could lookup its >> contact >> info and attempt to publish/lookup to that mpirun, only to find it >> doesn't >> respond because it either is already dead or on its way out. Hence the >> notion of restricting inter-job operations to the system-level ompi- >> server. >> >> If we can think of a way to deal with the race condition, I'm >> certainly >> willing to publish the contact info. I'm just concerned that you may >> find >> yourself "hung" if that mpirun goes away unexpectedly - say right in >> the >> middle of a publish/lookup operation. >> >> Ralph >> >>> >>> Aurelien >>> >>> Le 25 avr. 08 à 19:19, George Bosilca a écrit : >>> >>>> Ralph, >>>> >>>> Thanks for your concern regarding the level of compliance of our >>>> implementation of the MPI standard. I don't know who were the MPI >>>> gurus you talked with about this issue, but I can tell that for once >>>> the MPI standard is pretty clear about this. >>>> >>>> As stated by Aurelien in his last email, using the plural in several >>>> sentences, strongly suggest that the status of port should not be >>>> implicitly modified by MPI_Comm_accept or MPI_Comm_connect. >>>> Moreover, in the beginning of the chapter in the MPI standard, it is >>>> specified that comm/accept work exactly as in TCP. In other words, >>>> once the port is opened it stay open until the user explicitly close >>>> it. >>>> >>>> However, not all corner cases are addressed by the MPI standard. >>>> What happens on MPI_Finalize ... it's a good question. Personally, I >>>> think we should stick with the TCP similarities. The port should be >>>> not only closed by unpublished. This will solve all issues with >>>> people trying to lookup a port once the originator is gone. >>>> >>>> george. >>>> >>>> On Apr 25, 2008, at 5:25 PM, Ralph Castain wrote: >>>> >>>>> As I said, it makes no difference to me. I just want to ensure that >>>>> everyone >>>>> agrees on the interpretation of the MPI standard. We have had these >>>>> discussion in the past, with differing views. My guess here is that >>>>> the port >>>>> was left open mostly because the person who wrote the C-binding >>>>> forgot to >>>>> close it. ;-) >>>>> >>>>> So, you MPI folks: do we allow multiple connections against a >>>>> single port, >>>>> and leave the port open until explicitly closed? If so, then do we >>>>> generate >>>>> an error if someone calls MPI_Finalize without first closing the >>>>> port? Or do >>>>> we automatically close any open ports when finalize is called? >>>>> >>>>> Or do we automatically close the port after the connect/accept is >>>>> completed? >>>>> >>>>> Thanks >>>>> Ralph >>>>> >>>>> >>>>> >>>>> On 4/25/08 3:13 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> >>>>> wrote: >>>>> >>>>>> Actually, the port was still left open forever before the change. >>>>>> The >>>>>> bug damaged the port string, and it was not usable anymore, not >>>>>> only >>>>>> in subsequent Comm_accept, but also in Close_port or >>>>>> Unpublish_name. >>>>>> >>>>>> To more specifically answer to your open port concern, if the user >>>>>> does not want to have an open port anymore, he should specifically >>>>>> call MPI_Close_port and not rely on MPI_Comm_accept to close it. >>>>>> Actually the standard suggests the exact contrary: section 5.4.2 >>>>>> states "it must call MPI_Open_port to establish a port [...] it >>>>>> must >>>>>> call MPI_Comm_accept to accept connections from clients". Because >>>>>> there is multiple clients AND multiple connections in that >>>>>> sentence, I >>>>>> assume the port can be used in multiple accepts. >>>>>> >>>>>> Aurelien >>>>>> >>>>>> Le 25 avr. 08 à 16:53, Ralph Castain a écrit : >>>>>> >>>>>>> Hmmm...just to clarify, this wasn't a "bug". It was my >>>>>>> understanding >>>>>>> per the >>>>>>> MPI folks that a separate, unique port had to be created for >>>>>>> every >>>>>>> invocation of Comm_accept. They didn't want a port hanging around >>>>>>> open, and >>>>>>> their plan was to close the port immediately after the connection >>>>>>> was >>>>>>> established. >>>>>>> >>>>>>> So dpm_orte was written to that specification. When I reorganized >>>>>>> the code, >>>>>>> I left the logic as it had been written - which was actually done >>>>>>> by >>>>>>> the MPI >>>>>>> side of the house, not me. >>>>>>> >>>>>>> I have no problem with making the change. However, since the >>>>>>> specification >>>>>>> was created on the MPI side, I just want to make sure that the >>>>>>> MPI >>>>>>> folks all >>>>>>> realize this has now been changed. Obviously, if this change in >>>>>>> spec >>>>>>> is >>>>>>> adopted, someone needs to make sure that the C and Fortran >>>>>>> bindings - >>>>>>> do not- >>>>>>> close that port any more! >>>>>>> >>>>>>> Ralph >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 4/25/08 2:41 PM, "boute...@osl.iu.edu" <boute...@osl.iu.edu> >>>>>>> wrote: >>>>>>> >>>>>>>> Author: bouteill >>>>>>>> Date: 2008-04-25 16:41:44 EDT (Fri, 25 Apr 2008) >>>>>>>> New Revision: 18303 >>>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/18303 >>>>>>>> >>>>>>>> Log: >>>>>>>> Fix a bug that rpevented to use the same port (as returned by >>>>>>>> Open_port) for >>>>>>>> several Comm_accept) >>>>>>>> >>>>>>>> >>>>>>>> Text files modified: >>>>>>>> trunk/ompi/mca/dpm/orte/dpm_orte.c | 19 ++++++++++--------- >>>>>>>> 1 files changed, 10 insertions(+), 9 deletions(-) >>>>>>>> >>>>>>>> Modified: trunk/ompi/mca/dpm/orte/dpm_orte.c >>>>>>>> = >>>>>>>> = >>>>>>>> = >>>>>>>> = >>>>>>>> = >>>>>>>> = >>>>>>>> = >>>>>>>> = >>>>>>>> = >>>>>>>> = >>>>>>>> = >>>>>>>> = >>>>>>>> = >>>>>>>> = >>>>>>>> ================================================================ >>>>>>>> --- trunk/ompi/mca/dpm/orte/dpm_orte.c (original) >>>>>>>> +++ trunk/ompi/mca/dpm/orte/dpm_orte.c 2008-04-25 16:41:44 EDT >>>>>>>> (Fri, 25 Apr >>>>>>>> 2008) >>>>>>>> @@ -848,8 +848,14 @@ >>>>>>>> { >>>>>>>> char *tmp_string, *ptr; >>>>>>>> >>>>>>>> + /* copy the RML uri so we can return a malloc'd value >>>>>>>> + * that can later be free'd >>>>>>>> + */ >>>>>>>> + tmp_string = strdup(port_name); >>>>>>>> + >>>>>>>> /* find the ':' demarking the RML tag we added to the end */ >>>>>>>> - if (NULL == (ptr = strrchr(port_name, ':'))) { >>>>>>>> + if (NULL == (ptr = strrchr(tmp_string, ':'))) { >>>>>>>> + free(tmp_string); >>>>>>>> return NULL; >>>>>>>> } >>>>>>>> >>>>>>>> @@ -863,15 +869,10 @@ >>>>>>>> /* see if the length of the RML uri is too long - if so, >>>>>>>> * truncate it >>>>>>>> */ >>>>>>>> - if (strlen(port_name) > MPI_MAX_PORT_NAME) { >>>>>>>> - port_name[MPI_MAX_PORT_NAME] = '\0'; >>>>>>>> + if (strlen(tmp_string) > MPI_MAX_PORT_NAME) { >>>>>>>> + tmp_string[MPI_MAX_PORT_NAME] = '\0'; >>>>>>>> } >>>>>>>> - >>>>>>>> - /* copy the RML uri so we can return a malloc'd value >>>>>>>> - * that can later be free'd >>>>>>>> - */ >>>>>>>> - tmp_string = strdup(port_name); >>>>>>>> - >>>>>>>> + >>>>>>>> return tmp_string; >>>>>>>> } >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> svn mailing list >>>>>>>> s...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel