Re: [OMPI devel] Heads up on new feature to 1.3.4

Ralph Castain Mon, 17 Aug 2009 07:35:16 -0400

The problem is that the two mpiruns don't know about each other, andtherefore the second mpirun doesn't know that another mpirun hasalready used socket 0.


We hope to change that at some point in the future.


Ralph


On Aug 17, 2009, at 4:02 AM, Lenny Verkhovsky wrote:

In the multi job environment, can't we just start binding processeson the first avaliable and unused socket?
I mean first job/user will start binding itself from socket 0,
the next job/user will start binding itself from socket 2, forinstance .
Lenny.
On Mon, Aug 17, 2009 at 6:02 AM, Ralph Castain <r...@open-mpi.org>wrote:
On Aug 16, 2009, at 8:16 PM, Eugene Loh wrote:
Chris Samuel wrote:
----- "Eugene Loh" <eugene....@sun.com> wrote:
This is an important discussion.
Indeed! My big fear is that people won't pick up the significance
of the change and will complain about performance regressions
in the middle of an OMPI stable release cycle.
2) The proposed OMPI bind-to-socket default is less severe. In the
general case, it would allow multiple jobs to bind in the same way
without oversubscribing any core or socket. (This comment added to
the trac ticket.)
That's a nice clarification, thanks. I suspect though that the
same issue we have with MVAPICH would occur if two 4 core jobs
both bound themselves to the first socket.
Okay, so let me point out a second distinction from MVAPICH: thedefault policy would be to spread out over sockets.
Let's say you have two sockets, with four cores each. Let's sayyou submit two four-core jobs. The first job would put twoprocesses on the first socket and two processes on the second. Thesecond job would do the same. The loading would be even.
I'm not saying there couldn't be problems. It's just that MVAPICH2(at least what I looked at) has multiple shortfalls. The bindingis to fill up one socket after another (which decreases memorybandwidth per process and increases chances of collisions withother jobs) and binding is to core (increasing chances ofoversubscribing cores). The proposed OMPI behavior distributesover sockets (improving memory bandwidth per process and reducingcollisions with other jobs) and binding is to sockets (reducingchanges of oversubscribing cores, whether due to other MPI jobs ordue to multithreaded processes). So, the proposed OMPI behaviormitigates the problems.
It would be even better to have binding selections adapt to otherbindings on the system.
In any case, regardless of what the best behavior is, I appreciatethe point about changing behavior in the middle of a stablerelease. Arguably, leaving significant performance on the table intypical situations is a bug that warrants fixing even in the middleof a release, but I won't try to settle that debate here.
I think the problem here, Eugene, is that performance benchmarks arefar from the typical application. We have repeatedly seen this -optimizing for benchmarks frequently makes applications run lessefficiently. So I concur with Chris on this one - let's not go -too-benchmark happy and hurt the regular users.
Here at LANL, binding to-socket instead of to-core hurts performanceby ~5-10%, depending on the specific application. Of course, eitherbinding method is superior to no binding at all...
UNLESS you have a threaded application, in which case -any- bindingcan be highly detrimental to performance.
So going slow on this makes sense. If we provide the capability, butleave it off by default, then people can test it against realapplications and see the impact. Then we can better assess the rightdefault settings.
Ralph
3) Defaults (if I understand correctly) can be set differently
on each cluster.
Yes, but the defaults should be sensible for the majority of
clusters.  If the majority do indeed share nodes between jobs
then I would suggest that the default should be off and the
minority who don't share nodes should have to enable it.
In debates on this subject, I've heard people argue that:

*) Though nodes are getting fatter, most are still thin.

*) Resource managers tend to space share the cluster.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Heads up on new feature to 1.3.4

Reply via email to