Sorry for the delay - got tied up. This looks sane to me and should work.
FWIW: I haven't seen any problem with comm_spawn_multiple. That code will only
executes if the specific limit is hit, so I suspect that is an issue of scale
and race conditions.
On Jul 12, 2011, at 6:44 PM, Eugene Loh wrote:
> Thanks for the quick and helpful response.
>
> I'm willing to make the changes if only I were more confident I understood
> the issues.
>
> What you call the first usage of num_procs_alive seems to count "local procs
> alive or about to be launched". It tests on "child->alive ||
> opal_dss.compare(&job, &(child->name->jobid), ORTE_JOBID))". This is only in
> the !oversubscribed case.
>
> What you call "the later usage" ("0 < opal_sys_limits.num_procs" or "check
> the system limits") seems to want only the number of alive procs (not
> including any that are about to be launched). So, as I guess you were
> driving at, a fresh computation of num_procs_alive seems to be needed at this
> point, both in v1.5 and in the trunk.
>
> There is actually a third usage of num_procs_alive ("available file
> descriptors" or "0 < opal_sys_limits.num_files"). I think there are two
> problems here. One is that the number of procs we're about to launch is not
> accounted for. Secondly, and perhaps most importantly, if we "wait a little
> time" and recompute num_procs_alive, we loop over item/child. This is bad
> since we are already inside an item/child loop! (This is actually the origin
> of my foray into this mess... MPI_Comm_spawn_multiple gets screwed up by this
> reuse of variable names.)
>
> I'm attaching an original (trunk) file, a proposed version, and the diff
> file. Let me know if it looks sane. If so, I'll try it out and put it back
> and then file a CMR for 1.5.
>
> Some more below...
>
> On 07/12/11 18:19, Ralph Castain wrote:
>
>> Well, yes and no.
>>
>> Yes - the value definitely needs to be computed in all cases since it is
>> used later on.
>>
>> No - the way it is computed is only correct for that one usage. The later
>> usage (in the block that starts with 0< opal_sys_limits.num_procs)) needs a
>> completely different value.
>>
>> The current computation only comes up with the number of procs for the
>> specific job being launched. The latter code requires the total number of
>> local procs that are alive - a rather different value.
>
> The first usage, to compare with num_processors, includes not only procs for
> the job being launched but also already-alive procs.
>
> The second usage, to be compared to opal_sys_limits.num_procs, should only
> have alive procs since app->num_procs will be added later. (Plus, we try to
> keep a running total of this number of alive procs.)
>
> The third usage, to be compared to opal_sys_limits.num_files, should also
> only have alive procs for the same reasons.
>
>> So v1.5 is incorrect in either case. Would you like me to create a patch to
>> correct it, or would you like to do it (covering both cases)?
>
> Trunk also appears to be incorrect.
>
>> On Jul 12, 2011, at 3:35 PM, Eugene Loh wrote:
>>> The function orte_odls_base_default_launch_local() has a variable
>>> num_procs_alive that is basically initialized like this:
>>>
>>> if ( oversubscribed ) {
>>> ...
>>> } else {
>>> num_procs_alive = ...;
>>> }
>>>
>>> Specifically, if the "oversubscribed" test passes, the variable is not
>>> initialized.
>>>
>>> (Strictly speaking, this is true only in v1.5. In the trunk, the variable
>>> is set to 0 when it is declared, but I'm not sure that's very helpful.)
>>>
>>> I'm inclined to move the num_procs_alive computation ahead of the "if"
>>> block so that this computation is always performed.
> <files.tar.gz>_______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel