Nadia Derbey wrote:

What: Warn the administrator when unusual events are occurring too
frequently.

Why: Such unusual events might be the symptom of some problem that can
easily be fixed (by a better tuning, for example)
Before Sun HPC ClusterTools adopted the Open MPI code base (that is, CT6 and earlier), there was some performance analysis support called MPProf. See http://docs.sun.com/source/819-4134-10/profile.html#pgfId-999249 . The key characteristic was supposed to be that it would be very easy to use: set an environment variable before running; run a report generator afterwards; report is self explanatory; data volumes were relatively small and so easy to manage.

One part in particular seemed germane to your RFC: advice on implementation-specific environment variables. See http://docs.sun.com/source/819-4134-10/profile.html#pgfId-1000209 . Sun MPI had instrumentation embedded in it that looked for various "performance conditions". Then, in post processing, the report generator would translate that information into user-actionable feedback. At least, that was the concept. The idea would be that all user feedback should include:

*) a brief explanation of what happened ("you ran out of postboxes... see Appendix A.1.b.23 of user guide if you really dare to understand what this means") *) an estimate of how important this is ("we think this cost you 10% performance") *) a concise description of what to do to improve performance and discussion of ramifications ("set the environment variable MPI_NUMPOSTBOX to 256 and rerun, this will cost about 50 Mbyte more memory per process")

The feedback need not be limited to environment variables or implementation-specific conditions. E.g., perhaps one could detect when MPI_Ssend is used in place of MPI_Send and how much performance (unneeded synchronization) that cost.

Reply via email to