On May 26, 2012, at 10:10 AM, Eugene Loh wrote:

> I'm suspicious of some code, but would like comment from someone who 
> understands it.
> 
> In orte/util/nidmap.c orte_util_decode_pidmap(), one cycles through a buffer. 
>  One cycles through jobs.  For each one, one unpacks num_procs.  One also 
> unpacks all sorts of other stuff like bind_idx.  In particular, there's
> 
>    orte_process_info.bind_idx = bind_idx[ORTE_PROC_MY_NAME->vpid];
> 
> Well, if we spawn a job with more processes than the parent job, we could 
> have vpid >= num_procs and we read garbage which could and I think does lead 
> to some less-than-enjoyable experiences later on.
> 
> Yes/no/fix?

Well, actually it's a bit of all three :-/

First, you have to remember that we do NOT update pidmaps in application procs. 
So procs in the parent job only see the initial pidmap that contains only their 
own job - they never see the pidmap of their children. Thus, their data is 
correct.

The child job will see both pidmaps. However, the values being set in 
orte_process_info are being overwritten each time the code parses the data for 
a job. Since the jobs are recorded (and hence, parsed) in order, and the last 
job is the one a proc actually belongs to, the values being set actually turn 
out to be correct.

Still, the code really isn't right (especially when we begin to update pidmaps, 
which is coming soon) and merited a fix. So I committed one (r26498)

Thanks
Ralph

> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to