Hi all -

LANL had an internal meeting yesterday trying to classify a number of
issues we're having with the run-time environment for Open MPI and how
to best prioritize team resources.  We thought it would be good to
both share the list (with priorities) with the group and to ask the
group if there were other issues that need to be addressed (either
short or long term).  We've categorized the issues as performance
related, robustness, and feature / platform support.  The numbers are
the current priority on our list, and items within a category are
sorted by priority.


PERFORMANCE:

5) 50% scale factor in process startup

   Start-up of non-MPI jobs has a strange bend in the timing curve
   when the number of processes we are trying to start is greater than
   or equal to 50% of the current allocation.  It appears that
   starting a 16 process (1 ppn) job takes longer if there are 32
   nodes in the allocation than if there are 64 nodes in the
   allocation.

   Assigned to: Galen

6) MPI_INIT startup timings

   In addition to seeming to suffer from the same 50% issue as the
   previous issue, there also appears to be a number of places in
   MPI_INIT where we spend a considerable amount of time when at
   scale, leading to startup times much worse than LA-MPI or
   MPIEXEC/MVAPICH.

   Assigned to: Galen


ROBUSTNESS:

1) MPI process aborting issue

   This is the orted spin, MPI processes don't die, etc. issue that
   occurs when some process dies unexpectedly.  Ralph has already sent
   a detailed e-mail to devel about this issue.

   Assigned to: Ralph

1.5) MPI_ABORT rework

   The MPI process aborting issue is going to require a rework of
   MPI_ABORT so that it uses the error manager instead of calling
   terminate_proc/terminate_job.

   Assigned to: Brian

2) ORTE hangs when start-up fails

   If an orted fails to start or fails to connect back to the HNP, the
   system hangs waiting for the callback.  If a orted process fails to
   start entirely, we sometimes catch this.  But we need a better
   mechanism for handling the general failure case.

   Assigned to: Ralph

3) Hardened cleanup of session directory

   While #1 should greatly help in ensuring that the session directory
   is cleaned up every time, there are still a number of race
   conditions that need to be sorted out.  The goal is to develop a
   plan that ensures files that need to be removed are removed
   automatically a high percentage of the time, that there is a way to
   allow a tool like orte_clean to clean up everything it should clean
   up, and that there is a way to make sure files that should not be
   automatically removed aren't automatically removed.

   Assigned to: Brian

3.5) Process not found hangs

   See https://svn.open-mpi.org/trac/ompi/ticket/245

   Assigned to: Ralph

7) Node death failures / hangs

   With the exception of BProc, if a node fails, we don't detect the
   failure.  Even if we did detect the failure, we have no general
   mechanism for dealing with that failure.  The bulk of this project
   is going to be adding a general SOH/SMR component that uses the OOB
   for timeout pings.

   Assigned to: Brian

15) More friendly error messages

   There are situations where we give something south of a useful
   error message when an error is found.  We should play nicer with
   users.

   Assigned to:

16) Consistent error checking

   We've had a number of recent instances of errors occuring, but not
   being propogated / returned to the user simply because no one ever
   checked the return code.  We need to audit most of ORTE to always
   check return codes.

   Assigned to:


FEATURE / PLATFORM SUPPORT:

4) TM error handling

   TM, while used on a number of large systems LANL needs to support,
   is not exactly friendly to usage at scale.  It seems that it likes
   to go away and cry to mamma for a couple seconds, returning system
   error messages, only to come back and be ok a second later.  This
   means that every TM call needs to be handled as if it's going to
   fail, and we need to be prepared to re-initialize the system (if
   possible) when failures occur.  In testing on t-bird, launching was
   usually pretty stable, but the calls to get the node allocations
   tended to result in the strange behavior.  These should definitely
   be re-startable type errors

   Assigned to: Brian

8) Hetergeneous Issues

   Assigned to:

9) External connections

   This covers issues like those the Eclipse team is experiencing.
   If, for example, a TCP connection to the seed is severed, it causes
   Open RTE to call abort, which means Eclipse just aborted.  That's
   not so good.  There are other naming / status issues that also need
   to be handled here.

   Assigned to:

9.5) Fix/Complete orte-ps and friends

    orte-ps / orte-clean / etc. all depend on being able to make a
    connection to the orte universe that doesn't result in bad things
    happening.  We should finish these things for obvious reasons.

    Assigned to:

10) Remote connections

   This is similar to #9, but includes the ability to start a remote
   HNP process

   Assigned to:

11) Dynamic MPI-2 support

   ORTE's support for the MPI-2 dynamics has some well-known issues.
   In addition, we need to change some behaviors at the Open MPI level
   to behave better.   

   Assigned to:

12) XCPU support

   The XCPU system is a distributed process management system
   implemented using the Plan 9 filesystem.  An RAS (possibly) and PLS
   are needed to support launching on XCPU systems.

   Assigned to:

13) Multi-cell support

   Assigned to:

14) Memory usage / null components

   This is related to an e-mail Ralph or Jeff sent yesterday regarding
   support for NULL components.  The idea is to not load all the
   components into memory if null is specified as the prefered
   component name.

   Assigned to:

15) RAS multi-component issues

   If you are in an allocation (say, TM or BProc) and try to specify
   --hostfile on the orterun command line, the hostfile option will be
   ignored and you'll use the previous allocation.  There are some
   other similar cases, all of which can result in rather unexpected
   behaviour from the user's point of view

   Assigned to:



Reply via email to