Leaving out many details, I think the arguments can be summarized as:

1. Ralph's argument is that per convention of our other 2 layers, 
"<foo>_finalize" should unconditionally finalize the layer.  Just do it.  It's 
also weird that opal_finalize() may actually do *nothing* (vs. finalizing at 
least all of its stuff but leave OPAL util stuff initialized) -- this is not 
symmetric.

2. George's argument is that for API symmetry, if you call opal_init_util, then 
opal shouldn't be finalized until opal_finalize_util is invoked.  Plus, we may 
want to use OPAL utils after opal_finalize someday (note that we don't do this 
today).

How about a compromise?

- Take what is (essentially) in opal_init() today and rename it to be 
opal_init_frameworks() -- because it's (mostly) initializing the OPAL MCA 
frameworks.

- Take what is (essentially) in opal_finalize() today and rename it to be 
opal_finalize_frameworks() -- because it's (mostly) finalizing the OPAL MCA 
frameworks.  Remove the call to opal_finalize_util() from this function.

- Remove all use of counters; calling opal_init*() will initialize (unless it 
has already been initialized), and calling opal_finalize*() will finalize 
(unless it has already been finalized).

- Create a new opal_init() function that is a wrapper around opal_init_util() 
and opal_init_frameworks().  Create a new opal_finalize() function that is a 
wrapper around opal_finalize_util() and opal_finalize_frameworks().

- orte_finalize() will call opal_finalize() -- i.e., it will unconditionally 
shut down all of OPAL.  This will remove the need for opal_finalize_util() in 
the MPI layer.

This seems to give all desired behaviors:

- All <foo>_finalize() functions will be unconditional.  The Law of Least 
Surprise is preserved.

- There are paths for split init and split finalize and combined init and 
combined finalize.  They can even be combined (e.g., split init and combined 
finalize -- which will be a common case, actually).

If we ever want to use OPAL utility behavior after orte_finalize() someday, we 
can.  E.g., we can pass a flag to orte_finalize() saying "use 
opal_finalize_frameworks() instead of opal_finalize()", or perhaps even "don't 
finalize OPAL at all."



On Jul 8, 2011, at 11:57 AM, George Bosilca wrote:

> 
> On Jul 8, 2011, at 16:15 , Ralph Castain wrote:
> 
>>> So we have opal_init * 1 and opal_util * 2. Clearly the opal util is not a 
>>> simple ON/OFF stuff. With Ralph patch the OPAL utilities will disappear as 
>>> soon as the OMPI layer call orte_fini. Luckily, today there is nothing 
>>> between the call to orte_fini and opal_finalize_util, so we're safe from a 
>>> segfault.
>> 
>> The point is that you shouldn't be calling opal_finalize_util separately. We 
>> do so now only because of the counter - there is no reason for doing it 
>> separately otherwise.
> 
> Absolutely not, we do so for consistency. If as a software layer have to 
> explicitly call the opal util initialization function (in order to access 
> some features), then it should __explicitly__ state when it doesn't need it 
> anymore (instead of relying on some other layer will do the right thing for 
> me).
> 
>> In other words, we created a counter, and then modified the code to make the 
>> counter work. There is no reason for it to exist as there is no use of the 
>> opal utilities following the call to orte_finalize.
> 
> It happens today that this is not the case, which doesn't means 1) nobody 
> will ever do it; 2) it is correct to just assume you can release it somewhere 
> else; 3) assume a bool is equivalent to a counter.
> 
>>> Moreover, from a software engineering point of view there are two choices 
>>> for allowing library composition (ORTE using OPAL, OMPI using ORTE and 
>>> OPAL, something else using OMPI and ORTE and OPAL). Either you do the 
>>> management at the lowest level using counters, or you provide accessors to 
>>> check the init/fini state of the library and do the management at the upper 
>>> level (similar to the MPI library). In Open MPI and this for the last 7 
>>> years we chose the first approach. And so far there was no compelling case 
>>> to switch.
>> 
>> Yes there was - we just never checked it. None of the tools were calling 
>> opal_finalize multiple times. There was an inherent understanding that 
>> calling orte_finalize would shut everything down. This wasn't the case 
>> because this hidden counter wasn't getting zero'd, and so opal_finalize 
>> never actually executed.
> 
> I dont get it. Why do a tool has to call the opal_finalize function multiple 
> times? Instead, each layer should call it as many time as it called the 
> corresponding initialization function, and because each layer is supposed to 
> get initialized and finalized a equivalent number of times everything will 
> just work.
> 
> The modification in your commit created two different behavior, one for 
> software using ORTE (which can safely assume everything was teared down after 
> orte_fini and can avoid calling the opal_finalize_util) and one for every 
> other software that doesn't use ORTE and therefore has to call 
> opal_finalize_util as many times as it called the corresponding init function.
> 
>> Now imagine there is an abnormal termination. You can't know for sure where 
>> it occurs - did we increment the counter already, or not? So how many times 
>> do I have to call opal_finalize and opal_finalize_util to get them to 
>> actually execute?
> 
> First I'll say that if it's only for abnormal termination, I don't really 
> care about not having memory leaks.   Now let's assume we do care about 
> memory leaks. First there are many process data left around, the job map the 
> modex info, countless other things that are significantly more difficult to 
> cleanup than the opal util. And then, as I saidf before each layer should 
> call the fini function exactly the same number of times it called the 
> corresponding init.
> 
>> The way things sat, I could only loop over opal_finalize and 
>> opal_finalize_util until I got back an error indicating it had finally 
>> executed. That is plain ugly.
>> 
>> It isn't a big deal, but creates a hidden 'gotcha' that results in some ugly 
>> code to compensate if you want to cleanly terminate under all conditions. If 
>> you have a compelling case where someone needs to access the opal utils 
>> -after- having called orte_finalize or opal_finalize, then I would welcome 
>> hearing about it.
> 
> We did not have to do any of this in the MPI layer, and we did have a correct 
> handling of this issue. 
> 
>  george.
> 
> PS: Small reminder in case we decide to withdraw this change: r24862 and 
> r24864 are now related.
> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to