I can't swear to this because I haven't fully grokked it yet, but I
believe the answer is:
1. if child jobs have completed, it won't hurt. I think the various
subsystem cleanup their bookkeeping when a job completes, so we could
possibly reuse the number. Might be some race conditions we would have
to resolve.
2. if child jobs haven't completed (which is the situation this
particular user was attempting), then we would have a problem with
jobid confusion. Once we get the procs launched, though, I'm not sure
how much of a problem there is - would have to investigate. Could
cause some bookkeeping problems for job completion.
Interesting possibility, though...consider it another option for now.
On Oct 22, 2008, at 12:53 PM, George Bosilca wrote:
What's happened if we roll around with the counter ?
george.
On Oct 22, 2008, at 2:49 PM, Ralph Castain wrote:
There recently was activity on the mailing lists where someone was
attempting to call comm_spawn 100,000 times. Setting aside the
threading issues that were the focus of that exchange, the fact is
that OMPI currently cannot handle that many comm_spawns.
The ORTE jobid is composed of two elements:
1. the top 16-bits is an "identifier" for that mpirun
2. the lower 16-bits is a running counter identifying the specific
job/launch for those procs.
Thus, we are limited to 64k comm_spawns.
Expanding this would require either revamping the entire way we
handle jobs (e.g., removing the mpirun identifier - major effort),
or expanding the orte_jobid_t from its current 32-bits to 64-bits.
Is this a problem we want to address?
Ralph
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel