I can't swear to this because I haven't fully grokked it yet, but I believe the answer is:

1. if child jobs have completed, it won't hurt. I think the various subsystem cleanup their bookkeeping when a job completes, so we could possibly reuse the number. Might be some race conditions we would have to resolve.

2. if child jobs haven't completed (which is the situation this particular user was attempting), then we would have a problem with jobid confusion. Once we get the procs launched, though, I'm not sure how much of a problem there is - would have to investigate. Could cause some bookkeeping problems for job completion.

Interesting possibility, though...consider it another option for now.



On Oct 22, 2008, at 12:53 PM, George Bosilca wrote:

What's happened if we roll around with the counter ?

 george.

On Oct 22, 2008, at 2:49 PM, Ralph Castain wrote:

There recently was activity on the mailing lists where someone was attempting to call comm_spawn 100,000 times. Setting aside the threading issues that were the focus of that exchange, the fact is that OMPI currently cannot handle that many comm_spawns.

The ORTE jobid is composed of two elements:

1. the top 16-bits is an "identifier" for that mpirun

2. the lower 16-bits is a running counter identifying the specific job/launch for those procs.

Thus, we are limited to 64k comm_spawns.

Expanding this would require either revamping the entire way we handle jobs (e.g., removing the mpirun identifier - major effort), or expanding the orte_jobid_t from its current 32-bits to 64-bits.

Is this a problem we want to address?
Ralph

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to