There recently was activity on the mailing lists where someone was attempting to call comm_spawn 100,000 times. Setting aside the threading issues that were the focus of that exchange, the fact is that OMPI currently cannot handle that many comm_spawns.

The ORTE jobid is composed of two elements:

1. the top 16-bits is an "identifier" for that mpirun

2. the lower 16-bits is a running counter identifying the specific job/ launch for those procs.

Thus, we are limited to 64k comm_spawns.

Expanding this would require either revamping the entire way we handle jobs (e.g., removing the mpirun identifier - major effort), or expanding the orte_jobid_t from its current 32-bits to 64-bits.

Is this a problem we want to address?
Ralph

Reply via email to