Re: [OMPI devel] [RFC] Low pressure OPAL progress

Sylvain Jeaugey Tue, 9 Jun 2009 03:55:59 -0400

Hi Ralph,

I'm entirely convinced that MPI doesn't have to save power in a normalscenario. The idea is just that if an MPI process is blocked (i.e. has notperformed progress for -say- 5 minutes (default in my implementation), westop busy polling and have the process drop from 100% CPU usage to 0%.

I do not call sleep() but usleep(). The result if quite the same, but isless hurting performance in case of (unexpected) restart.

However, the goal of my RFC was also to know if there was a more clean wayto achieve my goal, and from what I read, I guess I should look at the"tick" rate instead of trying to do my own delaying.

Don't worry, I was quite expecting the configure-in requirement. However,I don't think my patch is good for inclusion, it is only an example todescribe what I want to achieve.


Thanks a lot for your comments,
Sylvain

On Mon, 8 Jun 2009, Ralph Castain wrote:

I'm not entirely convinced this actually achieves your goals, but I can seesome potential benefits. I'm also not sure that power consumption is that bigof an issue that MPI needs to begin chasing "power saver" modes of operation,but that can be a separate debate some day.
I'm assuming you don't mean that you actually call "sleep()" as this would bevery bad - I'm assuming you just change the opal_progress "tick" rateinstead. True? If not, and you really call "sleep", then I would have tooppose adding this to the code base pending discussion with others who cancorroborate that this won't cause problems.
Either way, I could live with this so long as it was done as a "configure-in"capability. Just having the params default to a value that causes the systemto behave similarly to today isn't enough - we still wind up adding logicinto a very critical timing loop for no reason. A simple configure option of--enable-mpi-progress-monitoring would be sufficient to protect the code.
HTH
Ralph


On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:
What : when nothing has been received for a very long time - e.g. 5minutes, stop busy polling in opal_progress and switch to a usleep-basedone.
Why : when we have long waits, and especially when an application isdeadlock'ed, detecting it is not easy and a lot of power is wasted untilthe end of the time slice (if there is one).
Where : an example of how it could be implemented is available athttp://bitbucket.org/jeaugeys/low-pressure-opal-progress/
Principle
=========
opal_progress() ensures the progression of MPI communication. The currentalgorithm is a loop calling progress on all registered components. If theprogram is blocked, the loop will busy-poll indefinetely.
Going to sleep after a certain amount of time with nothing received isinteresting for two things :- Administrator can easily detect whether a job is deadlocked : all theprocesses are in sleep(). Currently, all processors are using 100% cpu andit is very hard to know if progression is still happening or not.
- When there is nothing to receive, power usage is highly reduced.
However, it could hurt performance in some cases, typically if we go tosleep just before the message arrives. This will highly depend on theparameters you give to the sleep mechanism.
At first, we can start with the following assumption : if the sleep takes Tusec, then sleeping after 10000xT should slow down Receives by a factorless than 0.01 %.
However, other processes may suffer from you being late, and be delayed byT usec (which may represent more than 0.01% for them).
So, the goal of this mechanism is mainly to detect far-too-long-waits andshould quite never be used in normal MPI jobs. It could also trigger awarning message when starting to sleep, or at least a trace in thenotifier.
Details of Implementation
=========================

Three parameters fully control the behaviour of this mechanism :
* opal_progress_sleep_count : number of unsuccessful opal_progress() callsbefore we start the timer (to prevent latency impact). It defaults to -1,which completely deactivates the sleep (and is therefore equivalent to theformer code). A value of 1000 can be thought of as a starting point toenable this mechanism.* opal_progress_sleep_trigger : time to wait before going tolow-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.* opal_progress_sleep_duration : time we sleep at each further unsuccessfulcall to opal_progress(). Default : 1000 (in us) = 1 ms.
The duration is big enough to make the process show 0% CPU in top, but lowenough to preserve a good trigger/duration ratio.
The trigger is voluntary high to keep a good trigger/duration ratio.Indeed, to prevent delays from causing chain reactions, trigger should behigher than duration * numprocs.
Possible Improvements & Pitfalls
================================
* Trigger could be set automatically at max(trigger, duration * numprocs *2).
* poll_start and poll_count could be fields of the opal_condition_t struct.
* The sleep section may be exported in a #define and reported in all theprogress pathes (I'm not sure my patch is good for progress threads forexample)
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [RFC] Low pressure OPAL progress

Reply via email to