Works for me :-) On Jul 9, 2011, at 5:43 AM, Jeff Squyres wrote:
> Leaving out many details, I think the arguments can be summarized as: > > 1. Ralph's argument is that per convention of our other 2 layers, > "<foo>_finalize" should unconditionally finalize the layer. Just do it. > It's also weird that opal_finalize() may actually do *nothing* (vs. > finalizing at least all of its stuff but leave OPAL util stuff initialized) > -- this is not symmetric. > > 2. George's argument is that for API symmetry, if you call opal_init_util, > then opal shouldn't be finalized until opal_finalize_util is invoked. Plus, > we may want to use OPAL utils after opal_finalize someday (note that we don't > do this today). > > How about a compromise? > > - Take what is (essentially) in opal_init() today and rename it to be > opal_init_frameworks() -- because it's (mostly) initializing the OPAL MCA > frameworks. > > - Take what is (essentially) in opal_finalize() today and rename it to be > opal_finalize_frameworks() -- because it's (mostly) finalizing the OPAL MCA > frameworks. Remove the call to opal_finalize_util() from this function. > > - Remove all use of counters; calling opal_init*() will initialize (unless it > has already been initialized), and calling opal_finalize*() will finalize > (unless it has already been finalized). > > - Create a new opal_init() function that is a wrapper around opal_init_util() > and opal_init_frameworks(). Create a new opal_finalize() function that is a > wrapper around opal_finalize_util() and opal_finalize_frameworks(). > > - orte_finalize() will call opal_finalize() -- i.e., it will unconditionally > shut down all of OPAL. This will remove the need for opal_finalize_util() in > the MPI layer. > > This seems to give all desired behaviors: > > - All <foo>_finalize() functions will be unconditional. The Law of Least > Surprise is preserved. > > - There are paths for split init and split finalize and combined init and > combined finalize. They can even be combined (e.g., split init and combined > finalize -- which will be a common case, actually). > > If we ever want to use OPAL utility behavior after orte_finalize() someday, > we can. E.g., we can pass a flag to orte_finalize() saying "use > opal_finalize_frameworks() instead of opal_finalize()", or perhaps even > "don't finalize OPAL at all." > > > > On Jul 8, 2011, at 11:57 AM, George Bosilca wrote: > >> >> On Jul 8, 2011, at 16:15 , Ralph Castain wrote: >> >>>> So we have opal_init * 1 and opal_util * 2. Clearly the opal util is not a >>>> simple ON/OFF stuff. With Ralph patch the OPAL utilities will disappear as >>>> soon as the OMPI layer call orte_fini. Luckily, today there is nothing >>>> between the call to orte_fini and opal_finalize_util, so we're safe from a >>>> segfault. >>> >>> The point is that you shouldn't be calling opal_finalize_util separately. >>> We do so now only because of the counter - there is no reason for doing it >>> separately otherwise. >> >> Absolutely not, we do so for consistency. If as a software layer have to >> explicitly call the opal util initialization function (in order to access >> some features), then it should __explicitly__ state when it doesn't need it >> anymore (instead of relying on some other layer will do the right thing for >> me). >> >>> In other words, we created a counter, and then modified the code to make >>> the counter work. There is no reason for it to exist as there is no use of >>> the opal utilities following the call to orte_finalize. >> >> It happens today that this is not the case, which doesn't means 1) nobody >> will ever do it; 2) it is correct to just assume you can release it >> somewhere else; 3) assume a bool is equivalent to a counter. >> >>>> Moreover, from a software engineering point of view there are two choices >>>> for allowing library composition (ORTE using OPAL, OMPI using ORTE and >>>> OPAL, something else using OMPI and ORTE and OPAL). Either you do the >>>> management at the lowest level using counters, or you provide accessors to >>>> check the init/fini state of the library and do the management at the >>>> upper level (similar to the MPI library). In Open MPI and this for the >>>> last 7 years we chose the first approach. And so far there was no >>>> compelling case to switch. >>> >>> Yes there was - we just never checked it. None of the tools were calling >>> opal_finalize multiple times. There was an inherent understanding that >>> calling orte_finalize would shut everything down. This wasn't the case >>> because this hidden counter wasn't getting zero'd, and so opal_finalize >>> never actually executed. >> >> I dont get it. Why do a tool has to call the opal_finalize function multiple >> times? Instead, each layer should call it as many time as it called the >> corresponding initialization function, and because each layer is supposed to >> get initialized and finalized a equivalent number of times everything will >> just work. >> >> The modification in your commit created two different behavior, one for >> software using ORTE (which can safely assume everything was teared down >> after orte_fini and can avoid calling the opal_finalize_util) and one for >> every other software that doesn't use ORTE and therefore has to call >> opal_finalize_util as many times as it called the corresponding init >> function. >> >>> Now imagine there is an abnormal termination. You can't know for sure where >>> it occurs - did we increment the counter already, or not? So how many times >>> do I have to call opal_finalize and opal_finalize_util to get them to >>> actually execute? >> >> First I'll say that if it's only for abnormal termination, I don't really >> care about not having memory leaks. Now let's assume we do care about >> memory leaks. First there are many process data left around, the job map the >> modex info, countless other things that are significantly more difficult to >> cleanup than the opal util. And then, as I saidf before each layer should >> call the fini function exactly the same number of times it called the >> corresponding init. >> >>> The way things sat, I could only loop over opal_finalize and >>> opal_finalize_util until I got back an error indicating it had finally >>> executed. That is plain ugly. >>> >>> It isn't a big deal, but creates a hidden 'gotcha' that results in some >>> ugly code to compensate if you want to cleanly terminate under all >>> conditions. If you have a compelling case where someone needs to access the >>> opal utils -after- having called orte_finalize or opal_finalize, then I >>> would welcome hearing about it. >> >> We did not have to do any of this in the MPI layer, and we did have a >> correct handling of this issue. >> >> george. >> >> PS: Small reminder in case we decide to withdraw this change: r24862 and >> r24864 are now related. >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel