What : when nothing has been received for a very long time - e.g. 5
minutes, stop busy polling in opal_progress and switch to a usleep-based
one.
Why : when we have long waits, and especially when an application is
deadlock'ed, detecting it is not easy and a lot of power is wasted until
the end of the time slice (if there is one).
Where : an example of how it could be implemented is available at
http://bitbucket.org/jeaugeys/low-pressure-opal-progress/
Principle
=========
opal_progress() ensures the progression of MPI communication. The current
algorithm is a loop calling progress on all registered components. If the
program is blocked, the loop will busy-poll indefinetely.
Going to sleep after a certain amount of time with nothing received is
interesting for two things :
- Administrator can easily detect whether a job is deadlocked : all the
processes are in sleep(). Currently, all processors are using 100% cpu and
it is very hard to know if progression is still happening or not.
- When there is nothing to receive, power usage is highly reduced.
However, it could hurt performance in some cases, typically if we go to
sleep just before the message arrives. This will highly depend on the
parameters you give to the sleep mechanism.
At first, we can start with the following assumption : if the sleep takes
T usec, then sleeping after 10000xT should slow down Receives by a factor
less than 0.01 %.
However, other processes may suffer from you being late, and be delayed by
T usec (which may represent more than 0.01% for them).
So, the goal of this mechanism is mainly to detect far-too-long-waits and
should quite never be used in normal MPI jobs. It could also trigger a
warning message when starting to sleep, or at least a trace in the
notifier.
Details of Implementation
=========================
Three parameters fully control the behaviour of this mechanism :
* opal_progress_sleep_count : number of unsuccessful opal_progress()
calls before we start the timer (to prevent latency impact). It defaults
to -1, which completely deactivates the sleep (and is therefore equivalent
to the former code). A value of 1000 can be thought of as a starting point
to enable this mechanism.
* opal_progress_sleep_trigger : time to wait before going to
low-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.
* opal_progress_sleep_duration : time we sleep at each further
unsuccessful call to opal_progress(). Default : 1000 (in us) = 1 ms.
The duration is big enough to make the process show 0% CPU in top, but low
enough to preserve a good trigger/duration ratio.
The trigger is voluntary high to keep a good trigger/duration ratio.
Indeed, to prevent delays from causing chain reactions, trigger should be
higher than duration * numprocs.
Possible Improvements & Pitfalls
================================
* Trigger could be set automatically at max(trigger, duration * numprocs *
2).
* poll_start and poll_count could be fields of the opal_condition_t
struct.
* The sleep section may be exported in a #define and reported in all the
progress pathes (I'm not sure my patch is good for progress threads for
example)