Works for me :-)

On Jul 9, 2011, at 5:43 AM, Jeff Squyres wrote:

> Leaving out many details, I think the arguments can be summarized as:
> 
> 1. Ralph's argument is that per convention of our other 2 layers, 
> "<foo>_finalize" should unconditionally finalize the layer.  Just do it.  
> It's also weird that opal_finalize() may actually do *nothing* (vs. 
> finalizing at least all of its stuff but leave OPAL util stuff initialized) 
> -- this is not symmetric.
> 
> 2. George's argument is that for API symmetry, if you call opal_init_util, 
> then opal shouldn't be finalized until opal_finalize_util is invoked.  Plus, 
> we may want to use OPAL utils after opal_finalize someday (note that we don't 
> do this today).
> 
> How about a compromise?
> 
> - Take what is (essentially) in opal_init() today and rename it to be 
> opal_init_frameworks() -- because it's (mostly) initializing the OPAL MCA 
> frameworks.
> 
> - Take what is (essentially) in opal_finalize() today and rename it to be 
> opal_finalize_frameworks() -- because it's (mostly) finalizing the OPAL MCA 
> frameworks.  Remove the call to opal_finalize_util() from this function.
> 
> - Remove all use of counters; calling opal_init*() will initialize (unless it 
> has already been initialized), and calling opal_finalize*() will finalize 
> (unless it has already been finalized).
> 
> - Create a new opal_init() function that is a wrapper around opal_init_util() 
> and opal_init_frameworks().  Create a new opal_finalize() function that is a 
> wrapper around opal_finalize_util() and opal_finalize_frameworks().
> 
> - orte_finalize() will call opal_finalize() -- i.e., it will unconditionally 
> shut down all of OPAL.  This will remove the need for opal_finalize_util() in 
> the MPI layer.
> 
> This seems to give all desired behaviors:
> 
> - All <foo>_finalize() functions will be unconditional.  The Law of Least 
> Surprise is preserved.
> 
> - There are paths for split init and split finalize and combined init and 
> combined finalize.  They can even be combined (e.g., split init and combined 
> finalize -- which will be a common case, actually).
> 
> If we ever want to use OPAL utility behavior after orte_finalize() someday, 
> we can.  E.g., we can pass a flag to orte_finalize() saying "use 
> opal_finalize_frameworks() instead of opal_finalize()", or perhaps even 
> "don't finalize OPAL at all."
> 
> 
> 
> On Jul 8, 2011, at 11:57 AM, George Bosilca wrote:
> 
>> 
>> On Jul 8, 2011, at 16:15 , Ralph Castain wrote:
>> 
>>>> So we have opal_init * 1 and opal_util * 2. Clearly the opal util is not a 
>>>> simple ON/OFF stuff. With Ralph patch the OPAL utilities will disappear as 
>>>> soon as the OMPI layer call orte_fini. Luckily, today there is nothing 
>>>> between the call to orte_fini and opal_finalize_util, so we're safe from a 
>>>> segfault.
>>> 
>>> The point is that you shouldn't be calling opal_finalize_util separately. 
>>> We do so now only because of the counter - there is no reason for doing it 
>>> separately otherwise.
>> 
>> Absolutely not, we do so for consistency. If as a software layer have to 
>> explicitly call the opal util initialization function (in order to access 
>> some features), then it should __explicitly__ state when it doesn't need it 
>> anymore (instead of relying on some other layer will do the right thing for 
>> me).
>> 
>>> In other words, we created a counter, and then modified the code to make 
>>> the counter work. There is no reason for it to exist as there is no use of 
>>> the opal utilities following the call to orte_finalize.
>> 
>> It happens today that this is not the case, which doesn't means 1) nobody 
>> will ever do it; 2) it is correct to just assume you can release it 
>> somewhere else; 3) assume a bool is equivalent to a counter.
>> 
>>>> Moreover, from a software engineering point of view there are two choices 
>>>> for allowing library composition (ORTE using OPAL, OMPI using ORTE and 
>>>> OPAL, something else using OMPI and ORTE and OPAL). Either you do the 
>>>> management at the lowest level using counters, or you provide accessors to 
>>>> check the init/fini state of the library and do the management at the 
>>>> upper level (similar to the MPI library). In Open MPI and this for the 
>>>> last 7 years we chose the first approach. And so far there was no 
>>>> compelling case to switch.
>>> 
>>> Yes there was - we just never checked it. None of the tools were calling 
>>> opal_finalize multiple times. There was an inherent understanding that 
>>> calling orte_finalize would shut everything down. This wasn't the case 
>>> because this hidden counter wasn't getting zero'd, and so opal_finalize 
>>> never actually executed.
>> 
>> I dont get it. Why do a tool has to call the opal_finalize function multiple 
>> times? Instead, each layer should call it as many time as it called the 
>> corresponding initialization function, and because each layer is supposed to 
>> get initialized and finalized a equivalent number of times everything will 
>> just work.
>> 
>> The modification in your commit created two different behavior, one for 
>> software using ORTE (which can safely assume everything was teared down 
>> after orte_fini and can avoid calling the opal_finalize_util) and one for 
>> every other software that doesn't use ORTE and therefore has to call 
>> opal_finalize_util as many times as it called the corresponding init 
>> function.
>> 
>>> Now imagine there is an abnormal termination. You can't know for sure where 
>>> it occurs - did we increment the counter already, or not? So how many times 
>>> do I have to call opal_finalize and opal_finalize_util to get them to 
>>> actually execute?
>> 
>> First I'll say that if it's only for abnormal termination, I don't really 
>> care about not having memory leaks.   Now let's assume we do care about 
>> memory leaks. First there are many process data left around, the job map the 
>> modex info, countless other things that are significantly more difficult to 
>> cleanup than the opal util. And then, as I saidf before each layer should 
>> call the fini function exactly the same number of times it called the 
>> corresponding init.
>> 
>>> The way things sat, I could only loop over opal_finalize and 
>>> opal_finalize_util until I got back an error indicating it had finally 
>>> executed. That is plain ugly.
>>> 
>>> It isn't a big deal, but creates a hidden 'gotcha' that results in some 
>>> ugly code to compensate if you want to cleanly terminate under all 
>>> conditions. If you have a compelling case where someone needs to access the 
>>> opal utils -after- having called orte_finalize or opal_finalize, then I 
>>> would welcome hearing about it.
>> 
>> We did not have to do any of this in the MPI layer, and we did have a 
>> correct handling of this issue. 
>> 
>> george.
>> 
>> PS: Small reminder in case we decide to withdraw this change: r24862 and 
>> r24864 are now related.
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to