Hi all - LANL had an internal meeting yesterday trying to classify a number of issues we're having with the run-time environment for Open MPI and how to best prioritize team resources. We thought it would be good to both share the list (with priorities) with the group and to ask the group if there were other issues that need to be addressed (either short or long term). We've categorized the issues as performance related, robustness, and feature / platform support. The numbers are the current priority on our list, and items within a category are sorted by priority.
PERFORMANCE: 5) 50% scale factor in process startup Start-up of non-MPI jobs has a strange bend in the timing curve when the number of processes we are trying to start is greater than or equal to 50% of the current allocation. It appears that starting a 16 process (1 ppn) job takes longer if there are 32 nodes in the allocation than if there are 64 nodes in the allocation. Assigned to: Galen 6) MPI_INIT startup timings In addition to seeming to suffer from the same 50% issue as the previous issue, there also appears to be a number of places in MPI_INIT where we spend a considerable amount of time when at scale, leading to startup times much worse than LA-MPI or MPIEXEC/MVAPICH. Assigned to: Galen ROBUSTNESS: 1) MPI process aborting issue This is the orted spin, MPI processes don't die, etc. issue that occurs when some process dies unexpectedly. Ralph has already sent a detailed e-mail to devel about this issue. Assigned to: Ralph 1.5) MPI_ABORT rework The MPI process aborting issue is going to require a rework of MPI_ABORT so that it uses the error manager instead of calling terminate_proc/terminate_job. Assigned to: Brian 2) ORTE hangs when start-up fails If an orted fails to start or fails to connect back to the HNP, the system hangs waiting for the callback. If a orted process fails to start entirely, we sometimes catch this. But we need a better mechanism for handling the general failure case. Assigned to: Ralph 3) Hardened cleanup of session directory While #1 should greatly help in ensuring that the session directory is cleaned up every time, there are still a number of race conditions that need to be sorted out. The goal is to develop a plan that ensures files that need to be removed are removed automatically a high percentage of the time, that there is a way to allow a tool like orte_clean to clean up everything it should clean up, and that there is a way to make sure files that should not be automatically removed aren't automatically removed. Assigned to: Brian 3.5) Process not found hangs See https://svn.open-mpi.org/trac/ompi/ticket/245 Assigned to: Ralph 7) Node death failures / hangs With the exception of BProc, if a node fails, we don't detect the failure. Even if we did detect the failure, we have no general mechanism for dealing with that failure. The bulk of this project is going to be adding a general SOH/SMR component that uses the OOB for timeout pings. Assigned to: Brian 15) More friendly error messages There are situations where we give something south of a useful error message when an error is found. We should play nicer with users. Assigned to: 16) Consistent error checking We've had a number of recent instances of errors occuring, but not being propogated / returned to the user simply because no one ever checked the return code. We need to audit most of ORTE to always check return codes. Assigned to: FEATURE / PLATFORM SUPPORT: 4) TM error handling TM, while used on a number of large systems LANL needs to support, is not exactly friendly to usage at scale. It seems that it likes to go away and cry to mamma for a couple seconds, returning system error messages, only to come back and be ok a second later. This means that every TM call needs to be handled as if it's going to fail, and we need to be prepared to re-initialize the system (if possible) when failures occur. In testing on t-bird, launching was usually pretty stable, but the calls to get the node allocations tended to result in the strange behavior. These should definitely be re-startable type errors Assigned to: Brian 8) Hetergeneous Issues Assigned to: 9) External connections This covers issues like those the Eclipse team is experiencing. If, for example, a TCP connection to the seed is severed, it causes Open RTE to call abort, which means Eclipse just aborted. That's not so good. There are other naming / status issues that also need to be handled here. Assigned to: 9.5) Fix/Complete orte-ps and friends orte-ps / orte-clean / etc. all depend on being able to make a connection to the orte universe that doesn't result in bad things happening. We should finish these things for obvious reasons. Assigned to: 10) Remote connections This is similar to #9, but includes the ability to start a remote HNP process Assigned to: 11) Dynamic MPI-2 support ORTE's support for the MPI-2 dynamics has some well-known issues. In addition, we need to change some behaviors at the Open MPI level to behave better. Assigned to: 12) XCPU support The XCPU system is a distributed process management system implemented using the Plan 9 filesystem. An RAS (possibly) and PLS are needed to support launching on XCPU systems. Assigned to: 13) Multi-cell support Assigned to: 14) Memory usage / null components This is related to an e-mail Ralph or Jeff sent yesterday regarding support for NULL components. The idea is to not load all the components into memory if null is specified as the prefered component name. Assigned to: 15) RAS multi-component issues If you are in an allocation (say, TM or BProc) and try to specify --hostfile on the orterun command line, the hostfile option will be ignored and you'll use the previous allocation. There are some other similar cases, all of which can result in rather unexpected behaviour from the user's point of view Assigned to: