At the weekly telecon this week, we talked about when to branch the 1.3 release. I was asked if I could provide a list of where we stand relative to promised functionality, at least as far as the RTE is concerned.
Here is what I have compiled, in rough grouping by priority as expressed to me: Promised, and needed * topo mapper - automated mapping that puts ranks on network-nearest neighbors. Required by several of LANL's more ambitious science projects. I'll hopefully have a prototype in the system before leaving on vacation. * xml output - required for Eclipse PTP support, desired by several other tools. As per the telecon, there is no way we can get something meaningful in the system before the proposed code branch. However, this is needed by Oct for PTP - more lenient timeframe from the other tools. What we -can- do is get the output framework created before the branch, and then add the xml component during the summer - but that requires a change from our usual policy of no new components in sub-releases. Requires new mpirun cmd line flag: -xml proposed. * upgrades to the sequential mapper - add ability to provide relative sequencing for automated node allocations, claim multiple slots for a rank. All fits within existing component. * local orted spawn - ability for remote orted to locally spawn a coprocessor process. Required for hybrid RR where MPI procs are needed on the coprocessor. Basic elements are in system, but need to be completed now that launch system is stabilizing. Promised, could be delayed * minimizing HNP sockets - everything we need is in the system. What we need is just to pass to the orteds the nodemap in a manner that they can decode and use during their startup so they don't have to callback to the HNP. The scheme has been designed - just needs to be implemented. * carto routed - uses the provided network topology to define RML message routing, thus minimizing message hops during startup. * direct/standalone launch - I believe the basic infrastructure is now present, and indeed at least a couple of systems use standalone launch methods now. Expanding that to additional environments will take new PLM/ESS components, perhaps with supporting utilities. Likely not appropriate for a sub-release. * static ports - basic infrastructure for procs and orteds to use static OOB/TCP ports, but we don't currently take advantage of it. This shouldn't require any API changes or major restructuring of code as everything required is already there. * add-hostfile, add-host - these were included in the hostfile wiki page description as they had been requested by several users. If not included in 1.3, we need to update the wiki page and include that fact in the FAQ section, at the least, since users were told this would be supported. Wanted/Requested by various users or developers * orted sm file - some of our improved behavior depends upon exclusive use of nodes. We can remove that constraint by letting jobs from different users that are colocated on a node have knowledge of each other's existence. It has been proposed that this be accomplished by creating a shared memory area that the procs/orteds can access to find out who else is on a node, what static ports they are using, etc. Design still to be worked out. * usage reporting - add appropriate mpirun cmd line option to request the orteds to report proc resource usage upon proc termination. Pretty trivial to do. Requested by a few users and a couple of tool developers. * tool query support - ability for a tool to interactively query process/job status, usage stats, etc. The tool comm library is partially implemented today, but doesn't support the full range of requested functionality. * support for recursive mpirun calls - this has come up a few times on the user list. Basically, it requires adding a new mpirun cmd line option (--recursive) so mpirun can purge the environment of mca params set during spawn before calling orte_init. Future improvements * reduced launch messaging - put launch information in orted's environment (for systems that support it) so that orted can determine and launch its local procs without communicating back to the HNP. We have a design for this capability, but have purposely held off implementation until after the 1.3 branch. * minimized mpirun memory footprint - we currently store a bunch of info to support various debuggers, c/r, etc. This info isn't actually required to be stored for operation of the MPI job and/or ORTE, so it could either be released or simply not created. This plan calls for yet another option(!) that would tell mpirun to minimize its memory footprint. Design has been done - implementation has not started.