Re: [OMPI devel] [RFC] Low pressure OPAL progress

Ralph Castain Tue, 9 Jun 2009 07:54:41 -0400

My concern with any form of sleep is with the impact on the proc -since opal_progress might not be running in a separate thread, won'tthe sleep apply to the process as a whole? In that case, the processisn't free to continue computing.

I can envision applications that might call down into the MPI libraryand have opal_progress not find anything, but there is nothing wrong.The application could continue computations just fine. I would hate tosee us put the process to sleep just because the MPI library wasn'tbusy enough.

Hence my suggestion to just change the tick rate. It would definitelycause a higher latency for the first message that arrived while inthis state, which is bothersome, but would meet the stated objectivewithout interfering with the process itself.

LANL has also been looking at this problem of stalled jobs, but from adifferent approach. We monitor (using a separate job) progress interms of output files changing in size plus other factors as specifiedby the user. If we don't see any progress in those terms over sometime, then we kill the job. We chose that path because of the concernsexpressed above - e.g., on our RR machine, intense computations can beunderway on the Cell blades while the Opteron MPI processes wait forus to reach a communication point. We -want- those processes spinningaway so that, when the comm starts, it can proceed as quickly aspossible.


Just some thoughts...
Ralph


On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:

Sylvain Jeaugey wrote:
Hi Ralph,
I'm entirely convinced that MPI doesn't have to save power in anormal scenario. The idea is just that if an MPI process is blocked(i.e. has not performed progress for -say- 5 minutes (default in myimplementation), we stop busy polling and have the process dropfrom 100% CPU usage to 0%.
I do not call sleep() but usleep(). The result if quite the same,but is less hurting performance in case of (unexpected) restart.
However, the goal of my RFC was also to know if there was a moreclean way to achieve my goal, and from what I read, I guess Ishould look at the "tick" rate instead of trying to do my owndelaying.
One way around this is to make all blocked communications (even SM)to use poll to block for incoming messages. Jeff and I havediscussed this and had many false starts on it. The biggest issueis coming up with a way to have blocks on the SM btl converted tothe system poll call without requiring a socket write for everypacket.
The usleep solution works but is kind of ugly IMO. I think when Ilooked at doing that the overhead increased signifcantly for certaincommunications. Maybe not for toy benchmarks but for lesssynchronized processes I saw the usleep adding overhead where Ididn't want it too.
--td
Don't worry, I was quite expecting the configure-in requirement.However, I don't think my patch is good for inclusion, it is onlyan example to describe what I want to achieve.
Thanks a lot for your comments,
Sylvain

On Mon, 8 Jun 2009, Ralph Castain wrote:
I'm not entirely convinced this actually achieves your goals, butI can see some potential benefits. I'm also not sure that powerconsumption is that big of an issue that MPI needs to beginchasing "power saver" modes of operation, but that can be aseparate debate some day.
I'm assuming you don't mean that you actually call "sleep()" asthis would be very bad - I'm assuming you just change theopal_progress "tick" rate instead. True? If not, and you reallycall "sleep", then I would have to oppose adding this to the codebase pending discussion with others who can corroborate that thiswon't cause problems.
Either way, I could live with this so long as it was done as a"configure-in" capability. Just having the params default to avalue that causes the system to behave similarly to today isn'tenough - we still wind up adding logic into a very critical timingloop for no reason. A simple configure option of --enable-mpi-progress-monitoring would be sufficient to protect the code.
HTH
Ralph


On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:
What : when nothing has been received for a very long time - e.g.5 minutes, stop busy polling in opal_progress and switch to ausleep-based one.
Why : when we have long waits, and especially when an applicationis deadlock'ed, detecting it is not easy and a lot of power iswasted until the end of the time slice (if there is one).
Where : an example of how it could be implemented is available at 
http://bitbucket.org/jeaugeys/low-pressure-opal-progress/

Principle
=========
opal_progress() ensures the progression of MPI communication. Thecurrent algorithm is a loop calling progress on all registeredcomponents. If the program is blocked, the loop will busy-pollindefinetely.
Going to sleep after a certain amount of time with nothingreceived is interesting for two things :- Administrator can easily detect whether a job is deadlocked :all the processes are in sleep(). Currently, all processors areusing 100% cpu and it is very hard to know if progression isstill happening or not.
- When there is nothing to receive, power usage is highly reduced.
However, it could hurt performance in some cases, typically if wego to sleep just before the message arrives. This will highlydepend on the parameters you give to the sleep mechanism.
At first, we can start with the following assumption : if thesleep takes T usec, then sleeping after 10000xT should slow downReceives by a factor less than 0.01 %.
However, other processes may suffer from you being late, and bedelayed by T usec (which may represent more than 0.01% for them).
So, the goal of this mechanism is mainly to detect far-too-long-waits and should quite never be used in normal MPI jobs. It couldalso trigger a warning message when starting to sleep, or atleast a trace in the notifier.
Details of Implementation
=========================

Three parameters fully control the behaviour of this mechanism :
* opal_progress_sleep_count : number of unsuccessfulopal_progress() calls before we start the timer (to preventlatency impact). It defaults to -1, which completely deactivatesthe sleep (and is therefore equivalent to the former code). Avalue of 1000 can be thought of as a starting point to enablethis mechanism.* opal_progress_sleep_trigger : time to wait before going to low-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.* opal_progress_sleep_duration : time we sleep at each furtherunsuccessful call to opal_progress(). Default : 1000 (in us) = 1ms.
The duration is big enough to make the process show 0% CPU intop, but low enough to preserve a good trigger/duration ratio.
The trigger is voluntary high to keep a good trigger/durationratio. Indeed, to prevent delays from causing chain reactions,trigger should be higher than duration * numprocs.
Possible Improvements & Pitfalls
================================
* Trigger could be set automatically at max(trigger, duration *numprocs * 2).
* poll_start and poll_count could be fields of theopal_condition_t struct.
* The sleep section may be exported in a #define and reported inall the progress pathes (I'm not sure my patch is good forprogress threads for example)
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [RFC] Low pressure OPAL progress

Reply via email to