subject:"Re\: \[OMPI devel\] Heads up on new feature to 1.3.4"

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-21 Thread Jeff Squyres


On Aug 17, 2009, at 7:59 PM, Chris Samuel wrote:


Ah, I think I've misunderstood the website then. :-(

It calls 1.3 stable and 1.2 old and I presumed old
meant deprecated. :-(




To clarify...

1.3 *is* stable, meaning "ok for production use."  We test all 1.3  
releases before they go out, it undergoes regression testing, etc.


We have two different version series:

1. Odd minor numbers (e.g., 1.3.x): Feature release series.  They're  
stable and usable, but features may come and go during successive  
releases.  To be clear: feature release series does not mean "beta" or  
"we haven't tested this much".


2. Even minor number (e.g., 1.4.x) : Super stable series.  Only bug  
fixes will be applied; no feature additions or subtractions will occur.


Both series have the same level of testing before they are released.   
The difference is mainly a classification indicating whether new  
features can be added / subtracted or not.


--
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-21 Thread Chris Samuel


- "Eugene Loh"  wrote:

> Actually, the current proposed defaults for 1.3.4 are
> not to change the defaults at all.

Thanks, I hadn't picked up on the latest update to the
trac ticket 3 days ago that says that the defaults will
stay the same. Sounds good to me!

All the best and have a good weekend all!
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-21 Thread Eugene Loh





Chris Samuel wrote:

  - "Chris Samuel"  wrote:
  
  
$ mpiexec --mca opal_paffinity_alone 1 -bysocket -bind-to-socket -mca
odls_base_report_bindings 99 -mca odls_base_verbose 7 ./cpi-1.4

  
  To clarify - does that command line accurately reflect the
proposed defaults for OMPI 1.3.4 ?
  

Not the verbose/reporting options.

Actually, the current proposed defaults for 1.3.4 are not to change the
defaults at all.

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-21 Thread Chris Samuel


- "Chris Samuel"  wrote:

> $ mpiexec --mca opal_paffinity_alone 1 -bysocket -bind-to-socket -mca
> odls_base_report_bindings 99 -mca odls_base_verbose 7 ./cpi-1.4

To clarify - does that command line accurately reflect the
proposed defaults for OMPI 1.3.4 ?

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-18 Thread Chris Samuel


- "Chris Samuel"  wrote:

> This is most likely because it's getting an error from the
> kernel when trying to bind to a socket it's not permitted
> to access.

This is what strace reports:

18561 sched_setaffinity(18561, 8,  { f0 } 
18561 <... sched_setaffinity resumed> ) = -1 EINVAL (Invalid argument)

so that would appear to be it.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-18 Thread Chris Samuel


- "Eugene Loh"  wrote:

> Ah, you're missing the third secret safety switch that prevents
> hapless mortals from using this stuff accidentally!  :^)

Sounds good to me. :-)

> I think you need to add
> 
> --mca opal_paffinity_alone 1


Yup, looks like that's it; it fails to launch with that..


$ mpiexec --mca opal_paffinity_alone 1 -bysocket -bind-to-socket -mca 
odls_base_report_bindings 99 -mca odls_base_verbose 7 ./cpi-1.4
[tango095.vpac.org:18548] mca:base:select:( odls) Querying component [default]
[tango095.vpac.org:18548] mca:base:select:( odls) Query of component [default] 
set priority to 1
[tango095.vpac.org:18548] mca:base:select:( odls) Selected component [default]
[tango095.vpac.org:18548] [[33990,0],0] odls:launch: spawning child 
[[33990,1],0]
[tango095.vpac.org:18548] [[33990,0],0] odls:launch: spawning child 
[[33990,1],1]
[tango095.vpac.org:18548] [[33990,0],0] odls:default:fork binding child 
[[33990,1],0] to socket 0 cpus 000f
[tango095.vpac.org:18548] [[33990,0],0] odls:default:fork binding child 
[[33990,1],1] to socket 1 cpus 00f0
--
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI developers.
--
--
mpiexec was unable to start the specified application as it encountered an error
on node tango095.vpac.org. More information may be available above.
--
4 total processes failed to start


This is most likely because it's getting an error from the
kernel when trying to bind to a socket it's not permitted
to access.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-18 Thread Eugene Loh


Chris Samuel wrote:


OK, grabbed that (1.4a1r21825). Configured with:

./configure --prefix=$FOO --with-openib --with-tm=/usr/
local/torque/latest --enable-static  --enable-shared

It built & installed OK, but when running a trivial example
with it I don't see evidence for that code getting called.
Perhaps I'm not passing the correct options ?

$ mpiexec -bysocket -bind-to-socket -mca odls_base_report_bindings 99 -mca 
odls_base_verbose 7 ./cpi-1.4
 

Ah, you're missing the third secret safety switch that prevents hapless 
mortals from using this stuff accidentally!  :^)


I think you need to add

   --mca opal_paffinity_alone 1

a name that not even Ralph himself likes!

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-18 Thread Chris Samuel


- "Ralph Castain"  wrote:

> Hi Chris

Hiya,

> The devel trunk has all of this in it - you can get that tarball from 
> the OMPI web site (take the nightly snapshot).

OK, grabbed that (1.4a1r21825). Configured with:

./configure --prefix=$FOO --with-openib --with-tm=/usr/
local/torque/latest --enable-static  --enable-shared

It built & installed OK, but when running a trivial example
with it I don't see evidence for that code getting called.
Perhaps I'm not passing the correct options ?

$ mpiexec -bysocket -bind-to-socket -mca odls_base_report_bindings 99 -mca 
odls_base_verbose 7 ./cpi-1.4
[tango095.vpac.org:16976] mca:base:select:( odls) Querying component [default]
[tango095.vpac.org:16976] mca:base:select:( odls) Query of component [default] 
set priority to 1
[tango095.vpac.org:16976] mca:base:select:( odls) Selected component [default]
[tango095.vpac.org:16976] [[36578,0],0] odls:launch: spawning child 
[[36578,1],0]
[tango095.vpac.org:16976] [[36578,0],0] odls:launch: spawning child 
[[36578,1],1]
[tango095.vpac.org:16976] [[36578,0],0] odls:launch: spawning child 
[[36578,1],2]
[tango095.vpac.org:16976] [[36578,0],0] odls:launch: spawning child 
[[36578,1],3]
Process 0 on tango095.vpac.org
Process 1 on tango095.vpac.org
Process 2 on tango095.vpac.org
Process 3 on tango095.vpac.org
^Cmpiexec: killing job...

Increasing odls_base_verbose only seems to add the environment being
passed to the child processes. :-(

I'm pretty sure I've got the right code as ompi_info -a
reports the debug setting from the patch:

MCA odls: parameter "odls_base_report_bindings" (current value: <0>, data 
source: default value)

> I plan to work on cpuset support beginning Tues morning.

Great, anything I can help with then please let me know,
I'll be back from leave by then.

All the best,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Ralph Castain


Hi Chris

The devel trunk has all of this in it - you can get that tarball from  
the OMPI web site (take the nightly snapshot).


I plan to work on cpuset support beginning Tues morning.

Ralph

On Aug 17, 2009, at 7:18 PM, Chris Samuel wrote:



- "Eugene Loh"  wrote:

Hi Eugene,

[...]

It would be even better to have binding selections adapt to other
bindings on the system.


Indeed!

This touches on the earlier thread about making OMPI aware
of its cpuset/cgroup allocation on the node (for those sites
that are using it), it might solve this issue quite nicely as
OMPI would know precisely what cores & sockets were allocated
for its use without having to worry about other HPC processes.

No idea how to figure that out for processes outside of cpusets. :-(


In any case, regardless of what the best behavior is, I appreciate
the point about changing behavior in the middle of a stable release.


Not a problem, and I take Jeff's point about 1.3 not being a
super stable release and thus not being a blocker to changes
such as this.


Arguably, leaving significant performance on the table in typical
situations is a bug that warrants fixing even in the middle of a
release, but I won't try to settle that debate here.


I agree for those cases where there's no downside, and thinking
further on your point of balancing between sockets I can see why
that would limit the impact.

Most of the cases I can think of that would be most adversely
affected are down to other jobs binding to cores naively and if
that's happening outside of cpusets then the cluster sysadmin
has more to worry about from mixing those applications than
mixing with OMPI ones which are just binding to sockets. :-)

So I'll happily withdraw my objection on those grounds.

*But* I would like to test this code out on a cluster with
cpuset support enabled to see whether it will behave itself.

Basically if I run a 4 core MPI job on a dual socket system
which has been allocated only the cores on socket 0 what will
happen when it tries to bind to socket 1 which is outside its
cpuset ?

Is there a 1.3 branch or tarball with these patches applied
that I could test out ?

cheers,
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Chris Samuel

- "Eugene Loh"  wrote:

Hi Eugene,

[...]
> It would be even better to have binding selections adapt to other
> bindings on the system.

Indeed!

This touches on the earlier thread about making OMPI aware
of its cpuset/cgroup allocation on the node (for those sites
that are using it), it might solve this issue quite nicely as
OMPI would know precisely what cores & sockets were allocated
for its use without having to worry about other HPC processes.

No idea how to figure that out for processes outside of cpusets. :-(

> In any case, regardless of what the best behavior is, I appreciate
> the point about changing behavior in the middle of a stable release.

Not a problem, and I take Jeff's point about 1.3 not being a
super stable release and thus not being a blocker to changes
such as this.

> Arguably, leaving significant performance on the table in typical
> situations is a bug that warrants fixing even in the middle of a
> release, but I won't try to settle that debate here.

I agree for those cases where there's no downside, and thinking
further on your point of balancing between sockets I can see why
that would limit the impact.

Most of the cases I can think of that would be most adversely
affected are down to other jobs binding to cores naively and if
that's happening outside of cpusets then the cluster sysadmin
has more to worry about from mixing those applications than
mixing with OMPI ones which are just binding to sockets. :-)

So I'll happily withdraw my objection on those grounds.

*But* I would like to test this code out on a cluster with
cpuset support enabled to see whether it will behave itself.

Basically if I run a 4 core MPI job on a dual socket system
which has been allocated only the cores on socket 0 what will
happen when it tries to bind to socket 1 which is outside its
cpuset ?

Is there a 1.3 branch or tarball with these patches applied
that I could test out ?

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Ralph Castain



On Aug 17, 2009, at 5:59 PM, Chris Samuel wrote:



- "Jeff Squyres"  wrote:


An important point to raise here: the 1.3 series is *not* the super
stable series.  It is the *feature* series.  Specifically: it is not
out of scope to introduce or change features within the 1.3 series.


Ah, I think I've misunderstood the website then. :-(

It calls 1.3 stable and 1.2 old and I presumed old
meant deprecated. :-(


Old = I wouldn't use it, given the choice :-)




--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Chris Samuel


- "Jeff Squyres"  wrote:

> An important point to raise here: the 1.3 series is *not* the super  
> stable series.  It is the *feature* series.  Specifically: it is not 
> out of scope to introduce or change features within the 1.3 series.

Ah, I think I've misunderstood the website then. :-(

It calls 1.3 stable and 1.2 old and I presumed old
meant deprecated. :-(

-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread N.M. Maclaren


On Aug 17 2009, Paul H. Hargrove wrote:


+ I wonder if one can do any "introspection" with the dynamic linker to 
detect hybrid OpenMP (no "I") apps and avoid pinning them by default 
(examining OMP_NUM_THREADS in the environment is no good, since that 
variable may have a site default value other than 1 or empty).  To me 
this is the most obvious class of application that will suffer from 
imposing pinning by default.


This is a bit off-thread, but my experience with tuning 'threading'
(mainly OpenMP) is that it makes tuning processes (e.g. MPI) look
trivial.  You need affinity even more than you do for processes,
but few operating systems provide a way of binding threads to cores.
You can try tweaking the POSIX scheduling parameters, but I failed
to find a system on which they were connected to anything.  All right,
this is all a little out of date now, but I'll bet it hasn't changed
much.

That being so, a reasonable test would be to check for ANY secondary
thread in the process and/or threading call, and to throw in the towel
that that point.  I don't know ELF, but the latter can be done in most
reasonably advanced linkers (by using weak externals).

Despite their uncleanliness, some heuristics of this nature are probably
the only viable solution, for the reasons that Jeff described.  I stand
by my term "gratuitous hack"!


Regards,
Nick Maclaren.

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Patrick Geoffray


Jeff,



Jeff Squyres wrote:
ignored it whenever presenting competitive data.  The 1,000,000th time I 
saw this, I gave up arguing that our competitors were not being fair and 
simply changed our defaults to always leave memory pinned for 
OpenFabrics-based networks.


Instead, you should have told them that caching memory registration is 
unsafe and ask them why they don't care if their customers don't get the 
right answer. And then you would follow up by asking if they actually 
have a way to check that there is no data corruption. It's not really 
FUD, it's tit for tat :-)


2. Even if you tag someone in public for not being fair, they always say 
the same thing, "Oh sorry, my mistake" (regardless of whether they 
actually forgot or did it intentionally).  I told several competitors 
*many times* that they had to use leave_pinned, but in all public 
comparison numbers, they never did.  Hence, they always looked better.


Looked better on what, micro-benchmarks ? The same micro-benchmarks that 
have already been manipulated to death, like OSU using a stream-based 
bandwidth test to hide the start-up overhead ? If the option improves 
real applications at large, then it should be on by default and there is 
no debate (users should never have to know about knobs). If it is only 
for micro-benchmarks, stand your ground and do the right thing. It does 
not do the community any good if MPI implementations are tuned for a 
broken micro-benchmarks penis contest. If you want to play that game, at 
least make your own micro-benchmarks.


Believe me, I know what it is to hear technical atrocities from these 
marketing idiots. There is nothing you can do, they are payed to talk 
and you are not. In the end, HPC gets what HPC deserves, people should 
do their homework.


For applications at large, performance gains due to core-binding is 
suspect. Memory-binding may have more spine, but the OS should already 
be able do a good job with NUMA allocation and page migration.


- The Linux scheduler does no/cannot optimize well for many HPC apps; 
binding definitely helps in many scenarios (not just benchmarks).


Then fix the Linux scheduler. Only the OS scheduler can do a meaningful 
resource allocation, because it sees everything and you don't.




Patrick

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Paul H. Hargrove

Some more thoughts in this thread that I've not seen expressed yet 
(perhaps I missed them):


+ Some argue that this change in the middle of a stable series may, to 
some users, appear to be a performance regression when they update.  
However, I would argue that if the alternative is to delay this feature 
until the next stable release, it will STILL appear to those same users 
to be a performance regression when they upgrade.  If the choice is 
between sooner or later I would vote for sooner.


+ I wonder if one can do any "introspection" with the dynamic linker to 
detect hybrid OpenMP (no "I") apps and avoid pinning them by default 
(examining OMP_NUM_THREADS in the environment is no good, since that 
variable may have a site default value other than 1 or empty).  To me 
this is the most obvious class of application that will suffer from 
imposing pinning by default.


+ The question of round-robin-by-core vs round-robin-by-socket is not 
fundamentally any different from the question of how to map one's tasks 
to flat-SMP nodes (cylic, block or block-cylic; XYZT vs TXYZ, etc.)  
There is NO universal right answer, and for better or worse the end-user 
that wants to maximize performance is going to need to either understand 
how their comms interact with task layout, or they are going to try 
different options until the are happy.


-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group Tel: +1-510-495-2352
HPC Research Department   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Ashley Pittman

Some very good points in this thread all round.

On Mon, 2009-08-17 at 09:00 -0400, Jeff Squyres wrote:
> 
> This is probably not too surprising (i.e., allowing the OS to move  
> jobs around between cores on a socket can probably involve a little  
> cache thrashing, resulting in that 5-10% loss).  I'm hand-waving
> here,  
> and I have not tried this myself, but it's not too surprising of a  
> result to me.  It's also not too surprising that others don't see
> this  
> effect at all (e.g., Sun didn't see any difference in bind-to-core
> vs.  
> bind-to-socket) when they ran their tests.  YMMV.
> 
> I'd actually be in favor of a by-core binding (not by-socket), but  
> spreading the processes out round robin by socket, not by core.  All  
> of this would be the *default* behavior, of course -- command line  
> params/MCA params will be provided to change to whatever pattern is  
> desired.

I'm in favour of by-core binding, if it's done correctly I've seen
results that tie in with Ralphs 5-10% figure.  If it's done incorrectly
however it can be atrocious, the kernel scheduler may not be perfect but
at least it's never bad.

One (small) point nobody has mentioned yet is that when using
round-robin core binding some applications prefer you to round robin
by-socket and some prefer you to round-robin by-core.   This will depend
on their level of comms and any cache-sharing benefits.

Perhaps this is the reason Ralph saw improvements but Sun didn't?

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Jeff Squyres


On Aug 17, 2009, at 3:23 PM, N.M. Maclaren wrote:


>Yes, BUT...  We had a similar option to this for a long, long time.

Sorry, perhaps I should have spelled out what I meant by "mandatory".
The system would not build (or run, depending on where it was set)
without such a value being specified.  There would be no default.




Gotcha.  I have another "but", though.  :-)

OMPI already has about a billion configurable parameters.  If we  
*force* people to do something more than "mpirun -np x  
my_favorite_benchmark", then they'll say stuff like "we couldn't even  
get Open MPI to run" (I've seen people say that about other MPI's --  
fortunately, I haven't heard that about Open MPI except where either  
OMPI legitimately had a bug or the user had something wrong in their  
setup).


We work in a very nasty, competitive community.  :-(

--
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread N.M. Maclaren


On Aug 17 2009, Jeff Squyres wrote:


Yes, BUT...  We had a similar option to this for a long, long time.   


Sorry, perhaps I should have spelled out what I meant by "mandatory".
The system would not build (or run, depending on where it was set)
without such a value being specified.  There would be no default.


Regards,
Nick Maclaren.

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Jeff Squyres


On Aug 17, 2009, at 12:11 PM, N.M. Maclaren wrote:

1) To have a mandatory configuration option setting the default,  
which
would have a name like 'performance' for the binding option.  YOU  
could then
beat up anyone who benchmarkets without it for being biassed.  This  
is a
better solution, but the "I shouldn't need to have to think just  
because I

am doing something complicated" brigade would object.



Yes, BUT...  We had a similar option to this for a long, long time.   
Marketing departments from other organizations / companies willfully  
ignored it whenever presenting competitive data.  The 1,000,000th time  
I saw this, I gave up arguing that our competitors were not being fair  
and simply changed our defaults to always leave memory pinned for  
OpenFabrics-based networks.


To be clear: the option was "--mca mpi_leave_pinned 1" -- granted, the  
name wasn't as obvious as "--performance", but this option was widely  
publicized and easy to know that you should do for benchmarks (with a  
name like --performance, the natural question will be "why don't you  
enable [--]performance by default?  This means that OMPI has --no- 
performance by default...?").  I would tell person/marketer X at a  
conference, "Hey, you didn't run with leave_pinned; our numbers are  
much better than that."  "Oh, sorry" they would inevitably say; "I'll  
fix it next time I make new slides."


There are several problems that arise from this scenario:

1. The competitors aren't interested in being fair.  Spin is  
everything.  HPC is highly competitive.


2. Even if you tag someone in public for not being fair, they always  
say the same thing, "Oh sorry, my mistake" (regardless of whether they  
actually forgot or did it intentionally).  I told several competitors  
*many times* that they had to use leave_pinned, but in all public  
comparison numbers, they never did.  Hence, they always looked better.


(/me takes a moment to calm down after venturing down memory lane of  
all the unfair comparisons made against OMPI... :-) )


3. To some degree, "out of the box performance" *is* a compelling  
reason.  Sure, I would hope that marketers and competitors to be  
ethical (they aren't, but you can hope anyway), but the naive / new  
user shouldn't need to know a million switches to get good performance.


Having good / simple switches to optimize for different workloads is a  
good thing (e.g., Platform MPI has some nice options for this kind of  
stuff).  But the bottom line is that you can't rely on someone running  
anything other "mpirun -np x my_favorite_benchmark".


-

Also, as an aside to many of the other posts, yes, this is a complex  
issue.  But:


- We're only talking about defaults, not absolute behavior.  If you  
want or need to disable/change this behavior, you certainly can.


- It's been stated a few times, but I feel that this is important:  
most other MPI's bind by default.  They're deriving performance  
benefits from this.  We're not.  Open MPI has to be competitive (or my  
management will ask me, "Why are you working on that crappy MPI?").


- The Linux scheduler does no/cannot optimize well for many HPC apps;  
binding definitely helps in many scenarios (not just benchmarks).


- Of course you can construct scenarios where things break / perform  
badly.  Particularly if you do Wrong Things.  If you do Wrong Things,  
you should be punished (e.g., via bad performance).  It's not the  
software's fault if you choose to bind 10 threads to 1 core.  It's not  
the software's fault if you're on a large SMP and you choose to  
dedicate all of the processors to HPC apps and don't leave any for the  
OS (particularly if you have a lot of OS activity).  And so on.  Of  
course, we should do a good job of trying to do reasonable things by  
default (e.g., not binding 10 threads to one core by default), and we  
should provide options (sometimes automatic) for disabling those  
reasonable things if we can't do them well.  But sometimes we *do*  
have to rely on the user telling us things.


- I took Ralph's previous remarks as a general statement about  
threading being problematic to any form of binding.  I talked to him  
on the phone -- he actually had a specific case in mind (what I would  
consider Wrong Behavior: binding N threads to 1 core).


-

Ralph and I chatted earlier; I would be ok to wait for the other 2  
pieces of functionality to come in before we make binding occur by  
default:


1. coordinate between multiple OMPI jobs on the same node to ensure  
not to bind to the same cores (or at least print a warning)


2. follow the binding directives of resource managers (SLURM, Torque,  
etc.)


Sun is free to make binding-by-default in the ClusterTools  
distribution if/whenever they want, of course.  I fully understand  
their reasoning for doing so.  They're also in a better position to  
coach their users when to use which options, etc. because they have  
direct contact

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread N.M. Maclaren


On Aug 17 2009, Ralph Castain wrote:


At issue for us is that other MPIs -do- bind by default, thus creating an
apparent performance advantage for themselves compared to us on standard
benchmarks run "out-of-the-box". We repeatedly get beat-up in papers and
elsewhere over our performance, when many times the major difference is in
the default binding. If we bind the same way they do, then the performance
gap disappears or is minimal.


The two obvious gratuitous hacks that I can see to tackle this are:

   1) To have a mandatory configuration option setting the default, which
would have a name like 'performance' for the binding option.  YOU could then
beat up anyone who benchmarkets without it for being biassed.  This is a
better solution, but the "I shouldn't need to have to think just because I
am doing something complicated" brigade would object.

   2) To use a heuristic to choose which algorithm to select, based on
the core count, number of users, load averages, number of active non-root
processes and similar unreliable indicators of the purpose for which the
system is being used.  It should be chosen so that it doesn't behave TOO
badly when it gets it wrong, as it will, but that it gets the case of a
dedicated benchmarketing system right most of the time.


Regards,
Nick Maclaren.

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Ralph Castain

I don't disagree with your statements. However, I was addressing the
specific question of two OpenMPI programs conflicting on process placement,
not the overall question you are raising.

The issue of when/if to bind has been debated for a long time. I agree that
having more options (bind-to-socket, bind-to-core, etc) makes sense and that
the choice of a default is difficult, for all the reasons that have been
raised in this thread.

At issue for us is that other MPIs -do- bind by default, thus creating an
apparent performance advantage for themselves compared to us on standard
benchmarks run "out-of-the-box". We repeatedly get beat-up in papers and
elsewhere over our performance, when many times the major difference is in
the default binding. If we bind the same way they do, then the performance
gap disappears or is minimal.

So this is why we are wrestling with this issue. I'm not sure of the best
compromise here, but I think people have raised good points on all sides.
Unfortunately, there problem isn't a perfect answer... :-/

Certainly, I have no clue what it would be! Not that smart :-)
Ralph

On Mon, Aug 17, 2009 at 9:12 AM, N.M. Maclaren  wrote:

> On Aug 17 2009, Ralph Castain wrote:
>
>  The problem is that the two mpiruns don't know about each other, and
>>  therefore the second mpirun doesn't know that another mpirun has  already
>> used socket 0.
>>
>> We hope to change that at some point in the future.
>>
>
> It won't help.  The problem is less likely to be that two jobs are running
> OpenMPI programs (that have been recently linked!), but that the other
> tasks
> are not OpenMPI at all.  I have mentioned daemons, kernel threads and so
> on,
> but think of shared-memory parallel programs (OpenMP etc.) and so on; a LOT
> of applications nowadays include some sort of threading.
>
> For the ordinary multi-user system, you don't want any form of binding. The
> scheduler is ricketty enough as it is, without confusing it further. That
> may change as the consequences of serious levels of multiple cores force
> that area to be improved, but don't hold your breath. And I haven't a clue
> which of the many directions scheduler design will go!
>
> I agree that having an option, and having it easy to experiment with, is
> the
> right way to go.  What the default should be is very much less clear.
>
> Regards,
> Nick Maclaren.
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Kenneth Lloyd


In some of the experiments I've run and studied on exclusive binding to
specific cores, the performance metrics (which have yielded both excellent
gains as well as phases of reduced performance) have depended upon the
nature of the experiment being run (a task partitioning problem) and how the
experimental data was organized (a data partitioning problem).

This is especially true when one considers the context in which the
experiment was run - meaning what other experiments scheduled either
concurrently or serially, the priorities of those experiments and the
configuration of the cluster / MPI network at any given point in time.

The approach we used was Bayesian. In other words, performance prediction
was conditioned on patterns of structure and context from both forward in
inverse Bayesian cycles.

Ken Lloyd

> -Original Message-
> From: devel-boun...@open-mpi.org 
> [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres
> Sent: Monday, August 17, 2009 7:01 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] Heads up on new feature to 1.3.4
> 
> On Aug 16, 2009, at 11:02 PM, Ralph Castain wrote:
> 
> > I think the problem here, Eugene, is that performance 
> benchmarks are 
> > far from the typical application. We have repeatedly seen this - 
> > optimizing for benchmarks frequently makes applications run less 
> > efficiently. So I concur with Chris on this one - let's not 
> go -too- 
> > benchmark happy and hurt the regular users.
> 
> FWIW, I've seen processor binding help real user codes, too.  
> Indeed, on a system where an MPI job has exclusive use of the 
> node, how does binding hurt you?
> 
> On nodes where multiple MPI jobs are running, if a resource 
> manager is being used, then we should be obeying what they 
> have specified for each job to use.  We need a bit more work 
> in that direction to make that work, but it's very do-able.
> 
> When resource managers are not used and multiple MPI jobs 
> share the same node, then OMPI has to coordinate amongst its 
> jobs to not oversubscribe cores (when possible).  As Ralph 
> indicated in a later mail, we still need some work in this area, too.
> 
> > Here at LANL, binding to-socket instead of to-core hurts 
> performance 
> > by ~5-10%, depending on the specific application. Of course, either 
> > binding method is superior to no binding at all...
> 
> This is probably not too surprising (i.e., allowing the OS to 
> move jobs around between cores on a socket can probably 
> involve a little cache thrashing, resulting in that 5-10% 
> loss).  I'm hand-waving here, and I have not tried this 
> myself, but it's not too surprising of a result to me.  It's 
> also not too surprising that others don't see this effect at 
> all (e.g., Sun didn't see any difference in bind-to-core vs.  
> bind-to-socket) when they ran their tests.  YMMV.
> 
> I'd actually be in favor of a by-core binding (not 
> by-socket), but spreading the processes out round robin by 
> socket, not by core.  All of this would be the *default* 
> behavior, of course -- command line params/MCA params will be 
> provided to change to whatever pattern is desired.
> 
> > UNLESS you have a threaded application, in which case -any- binding 
> > can be highly detrimental to performance.
> 
> I'm not quite sure I understand this statement.  Binding is 
> not inherently contrary to multi-threaded applications.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread N.M. Maclaren


On Aug 17 2009, Ralph Castain wrote:

The problem is that the two mpiruns don't know about each other, and  
therefore the second mpirun doesn't know that another mpirun has  
already used socket 0.


We hope to change that at some point in the future.


It won't help.  The problem is less likely to be that two jobs are running
OpenMPI programs (that have been recently linked!), but that the other tasks
are not OpenMPI at all.  I have mentioned daemons, kernel threads and so on,
but think of shared-memory parallel programs (OpenMP etc.) and so on; a LOT
of applications nowadays include some sort of threading.

For the ordinary multi-user system, you don't want any form of binding. The 
scheduler is ricketty enough as it is, without confusing it further. That 
may change as the consequences of serious levels of multiple cores force 
that area to be improved, but don't hold your breath. And I haven't a clue 
which of the many directions scheduler design will go!


I agree that having an option, and having it easy to experiment with, is the
right way to go.  What the default should be is very much less clear.

Regards,
Nick Maclaren.

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Eugene Loh


Jeff Squyres wrote:


On Aug 16, 2009, at 11:02 PM, Ralph Castain wrote:

UNLESS you have a threaded application, in which case -any- binding  
can be highly detrimental to performance.


I'm not quite sure I understand this statement.  Binding is not  
inherently contrary to multi-threaded applications.


I think the concern is that if a process binds to a particular core and 
all threads inherit the same binding, then all threads will bind to the 
same core, inhibiting multithreading speedups (at best).


If you bind to sockets rather than specific cores, even if multiple 
threads inherit the same binding, the contention will be less.

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread N.M. Maclaren


On Aug 17 2009, Jeff Squyres wrote:

On Aug 16, 2009, at 11:02 PM, Ralph Castain wrote:

I think the problem here, Eugene, is that performance benchmarks are  
far from the typical application. We have repeatedly seen this -  
optimizing for benchmarks frequently makes applications run less  
efficiently. So I concur with Chris on this one - let's not go -too-  
benchmark happy and hurt the regular users.


FWIW, I've seen processor binding help real user codes, too.  Indeed,  
on a system where an MPI job has exclusive use of the node, how does  
binding hurt you?


Here is how, and I can assure you that's it's not nice, not at all; it can
kill an application dead.  I have some experience with running large SMP
systems (Origin, SunFire F15K and POWER3/4 racks) and this area was a
nightmare.

Process A is bound, and is waiting briefly for a receive.  All of the
other cores are busy with the processors bound to them.  There is then some
action from another process, a daemon or a kernel thread that needs service
from the kernel.  So it starts a thread on process A's core.  Unfortunately,
this is a long-running thread (e.g. NFS) so, when the other processors
finish, and A is the bottleneck, the whole job hangs until that kernel
thread finishes.

You can get a similar effect if process A is bound to a CPU which has an 
I/O device bound to it. When something else entirely starts hammering that 
device, even if it doesn't tie it up for long each time, bye-bye 
performance. This is typically a problem on multi-socket systems, of 
course, but could show up even on quite small ones.


For this reason, many schedulers ignore binding hints when they 'think' they
know better - and, no matter what the documentation says, hints is generally
all they are.  You can then get processes rotating round the processors,
exercising the inter-cache buses nicely   In my experience, binding can
sometimes make that more likely rather than less, and the best solutions are
usually different.

Yes, I used binding, but it was hell to set up, and many people give up,
saying that it degrades performance.  I advise ordinary users to avoid it
like the plague, and use more reliable tuning techniques.

UNLESS you have a threaded application, in which case -any- binding  
can be highly detrimental to performance.


I'm not quite sure I understand this statement.  Binding is not  
inherently contrary to multi-threaded applications.


That is true.  But see above.

Another circumstance where that is true is when your application is a MPI
one, but which calls SMP-enabled libraries; this is getting increasingly
common.  Binding can stop those using spare cores or otherwise confuse
them; God help you if they start to use a 4-core algorithm on one core!


Regards,
Nick Maclaren.

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Jeff Squyres


On Aug 16, 2009, at 8:56 PM, George Bosilca wrote:


I tend to agree with Chris. Changing the behavior of the 1.3 in the
middle of the stable release cycle, will be very confusing for our
users.



An important point to raise here: the 1.3 series is *not* the super  
stable series.  It is the *feature* series.  Specifically: it is not  
out of scope to introduce or change features within the 1.3 series.


--
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Jeff Squyres


On Aug 16, 2009, at 11:02 PM, Ralph Castain wrote:

I think the problem here, Eugene, is that performance benchmarks are  
far from the typical application. We have repeatedly seen this -  
optimizing for benchmarks frequently makes applications run less  
efficiently. So I concur with Chris on this one - let's not go -too-  
benchmark happy and hurt the regular users.


FWIW, I've seen processor binding help real user codes, too.  Indeed,  
on a system where an MPI job has exclusive use of the node, how does  
binding hurt you?


On nodes where multiple MPI jobs are running, if a resource manager is  
being used, then we should be obeying what they have specified for  
each job to use.  We need a bit more work in that direction to make  
that work, but it's very do-able.


When resource managers are not used and multiple MPI jobs share the  
same node, then OMPI has to coordinate amongst its jobs to not  
oversubscribe cores (when possible).  As Ralph indicated in a later  
mail, we still need some work in this area, too.


Here at LANL, binding to-socket instead of to-core hurts performance  
by ~5-10%, depending on the specific application. Of course, either  
binding method is superior to no binding at all...


This is probably not too surprising (i.e., allowing the OS to move  
jobs around between cores on a socket can probably involve a little  
cache thrashing, resulting in that 5-10% loss).  I'm hand-waving here,  
and I have not tried this myself, but it's not too surprising of a  
result to me.  It's also not too surprising that others don't see this  
effect at all (e.g., Sun didn't see any difference in bind-to-core vs.  
bind-to-socket) when they ran their tests.  YMMV.


I'd actually be in favor of a by-core binding (not by-socket), but  
spreading the processes out round robin by socket, not by core.  All  
of this would be the *default* behavior, of course -- command line  
params/MCA params will be provided to change to whatever pattern is  
desired.


UNLESS you have a threaded application, in which case -any- binding  
can be highly detrimental to performance.


I'm not quite sure I understand this statement.  Binding is not  
inherently contrary to multi-threaded applications.


--
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Ralph Castain

The problem is that the two mpiruns don't know about each other, and  
therefore the second mpirun doesn't know that another mpirun has  
already used socket 0.


We hope to change that at some point in the future.

Ralph


On Aug 17, 2009, at 4:02 AM, Lenny Verkhovsky wrote:

In the multi job environment, can't we just start binding processes  
on the first avaliable and unused socket?

I mean first job/user will start binding itself from socket 0,
the next job/user will start binding itself from socket 2, for  
instance .

Lenny.

On Mon, Aug 17, 2009 at 6:02 AM, Ralph Castain   
wrote:


On Aug 16, 2009, at 8:16 PM, Eugene Loh wrote:


Chris Samuel wrote:


- "Eugene Loh"  wrote:


This is an important discussion.


Indeed! My big fear is that people won't pick up the significance
of the change and will complain about performance regressions
in the middle of an OMPI stable release cycle.

2) The proposed OMPI bind-to-socket default is less severe. In the
general case, it would allow multiple jobs to bind in the same way
without oversubscribing any core or socket. (This comment added to
the trac ticket.)


That's a nice clarification, thanks. I suspect though that the
same issue we have with MVAPICH would occur if two 4 core jobs
both bound themselves to the first socket.

Okay, so let me point out a second distinction from MVAPICH:  the  
default policy would be to spread out over sockets.


Let's say you have two sockets, with four cores each.  Let's say  
you submit two four-core jobs.  The first job would put two  
processes on the first socket and two processes on the second.  The  
second job would do the same.  The loading would be even.


I'm not saying there couldn't be problems.  It's just that MVAPICH2  
(at least what I looked at) has multiple shortfalls.  The binding  
is to fill up one socket after another (which decreases memory  
bandwidth per process and increases chances of collisions with  
other jobs) and binding is to core (increasing chances of  
oversubscribing cores).  The proposed OMPI behavior distributes  
over sockets (improving memory bandwidth per process and reducing  
collisions with other jobs) and binding is to sockets (reducing  
changes of oversubscribing cores, whether due to other MPI jobs or  
due to multithreaded processes).  So, the proposed OMPI behavior  
mitigates the problems.


It would be even better to have binding selections adapt to other  
bindings on the system.


In any case, regardless of what the best behavior is, I appreciate  
the point about changing behavior in the middle of a stable  
release.  Arguably, leaving significant performance on the table in  
typical situations is a bug that warrants fixing even in the middle  
of a release, but I won't try to settle that debate here.


I think the problem here, Eugene, is that performance benchmarks are  
far from the typical application. We have repeatedly seen this -  
optimizing for benchmarks frequently makes applications run less  
efficiently. So I concur with Chris on this one - let's not go -too-  
benchmark happy and hurt the regular users.


Here at LANL, binding to-socket instead of to-core hurts performance  
by ~5-10%, depending on the specific application. Of course, either  
binding method is superior to no binding at all...


UNLESS you have a threaded application, in which case -any- binding  
can be highly detrimental to performance.


So going slow on this makes sense. If we provide the capability, but  
leave it off by default, then people can test it against real  
applications and see the impact. Then we can better assess the right  
default settings.


Ralph



3) Defaults (if I understand correctly) can be set differently
on each cluster.


Yes, but the defaults should be sensible for the majority of
clusters.  If the majority do indeed share nodes between jobs
then I would suggest that the default should be off and the
minority who don't share nodes should have to enable it.


In debates on this subject, I've heard people argue that:

*) Though nodes are getting fatter, most are still thin.

*) Resource managers tend to space share the cluster.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Lenny Verkhovsky

In the multi job environment, can't we just start binding processes on the
first avaliable and unused socket?
I mean first job/user will start binding itself from socket 0,
the next job/user will start binding itself from socket 2, for instance .
Lenny.

On Mon, Aug 17, 2009 at 6:02 AM, Ralph Castain  wrote:

>
> On Aug 16, 2009, at 8:16 PM, Eugene Loh wrote:
>
>  Chris Samuel wrote:
>
> - "Eugene Loh"   wrote:
>
>
>  This is an important discussion.
>
>
>  Indeed! My big fear is that people won't pick up the significance
> of the change and will complain about performance regressions
> in the middle of an OMPI stable release cycle.
>
>  2) The proposed OMPI bind-to-socket default is less severe. In the
> general case, it would allow multiple jobs to bind in the same way
> without oversubscribing any core or socket. (This comment added to
> the trac ticket.)
>
>
>  That's a nice clarification, thanks. I suspect though that the
> same issue we have with MVAPICH would occur if two 4 core jobs
> both bound themselves to the first socket.
>
>
>  Okay, so let me point out a second distinction from MVAPICH:  the default
> policy would be to spread out over sockets.
>
> Let's say you have two sockets, with four cores each.  Let's say you submit
> two four-core jobs.  The first job would put two processes on the first
> socket and two processes on the second.  The second job would do the same.
> The loading would be even.
>
> I'm not saying there couldn't be problems.  It's just that MVAPICH2 (at
> least what I looked at) has multiple shortfalls.  The binding is to fill up
> one socket after another (which decreases memory bandwidth per process and
> increases chances of collisions with other jobs) and binding is to core
> (increasing chances of oversubscribing cores).  The proposed OMPI behavior
> distributes over sockets (improving memory bandwidth per process and
> reducing collisions with other jobs) and binding is to sockets (reducing
> changes of oversubscribing cores, whether due to other MPI jobs or due to
> multithreaded processes).  So, the proposed OMPI behavior mitigates the
> problems.
>
> It would be even better to have binding selections adapt to other bindings
> on the system.
>
> In any case, regardless of what the best behavior is, I appreciate the
> point about changing behavior in the middle of a stable release.  Arguably,
> leaving significant performance on the table in typical situations is a bug
> that warrants fixing even in the middle of a release, but I won't try to
> settle that debate here.
>
>
> I think the problem here, Eugene, is that performance benchmarks are far
> from the typical application. We have repeatedly seen this - optimizing for
> benchmarks frequently makes applications run less efficiently. So I concur
> with Chris on this one - let's not go -too- benchmark happy and hurt the
> regular users.
>
> Here at LANL, binding to-socket instead of to-core hurts performance by
> ~5-10%, depending on the specific application. Of course, either binding
> method is superior to no binding at all...
>
> UNLESS you have a threaded application, in which case -any- binding can be
> highly detrimental to performance.
>
> So going slow on this makes sense. If we provide the capability, but leave
> it off by default, then people can test it against real applications and see
> the impact. Then we can better assess the right default settings.
>
> Ralph
>
>
>  3) Defaults (if I understand correctly) can be set differently
> on each cluster.
>
>
>  Yes, but the defaults should be sensible for the majority of
> clusters.  If the majority do indeed share nodes between jobs
> then I would suggest that the default should be off and the
> minority who don't share nodes should have to enable it.
>
>
>  In debates on this subject, I've heard people argue that:
>
> *) Though nodes are getting fatter, most are still thin.
>
> *) Resource managers tend to space share the cluster.
>  ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Ralph Castain



On Aug 16, 2009, at 8:16 PM, Eugene Loh wrote:


Chris Samuel wrote:


- "Eugene Loh"  wrote:


This is an important discussion.


Indeed! My big fear is that people won't pick up the significance
of the change and will complain about performance regressions
in the middle of an OMPI stable release cycle.

2) The proposed OMPI bind-to-socket default is less severe. In the
general case, it would allow multiple jobs to bind in the same way
without oversubscribing any core or socket. (This comment added to
the trac ticket.)


That's a nice clarification, thanks. I suspect though that the
same issue we have with MVAPICH would occur if two 4 core jobs
both bound themselves to the first socket.

Okay, so let me point out a second distinction from MVAPICH:  the  
default policy would be to spread out over sockets.


Let's say you have two sockets, with four cores each.  Let's say you  
submit two four-core jobs.  The first job would put two processes on  
the first socket and two processes on the second.  The second job  
would do the same.  The loading would be even.


I'm not saying there couldn't be problems.  It's just that MVAPICH2  
(at least what I looked at) has multiple shortfalls.  The binding is  
to fill up one socket after another (which decreases memory  
bandwidth per process and increases chances of collisions with other  
jobs) and binding is to core (increasing chances of oversubscribing  
cores).  The proposed OMPI behavior distributes over sockets  
(improving memory bandwidth per process and reducing collisions with  
other jobs) and binding is to sockets (reducing changes of  
oversubscribing cores, whether due to other MPI jobs or due to  
multithreaded processes).  So, the proposed OMPI behavior mitigates  
the problems.


It would be even better to have binding selections adapt to other  
bindings on the system.


In any case, regardless of what the best behavior is, I appreciate  
the point about changing behavior in the middle of a stable  
release.  Arguably, leaving significant performance on the table in  
typical situations is a bug that warrants fixing even in the middle  
of a release, but I won't try to settle that debate here.


I think the problem here, Eugene, is that performance benchmarks are  
far from the typical application. We have repeatedly seen this -  
optimizing for benchmarks frequently makes applications run less  
efficiently. So I concur with Chris on this one - let's not go -too-  
benchmark happy and hurt the regular users.


Here at LANL, binding to-socket instead of to-core hurts performance  
by ~5-10%, depending on the specific application. Of course, either  
binding method is superior to no binding at all...


UNLESS you have a threaded application, in which case -any- binding  
can be highly detrimental to performance.


So going slow on this makes sense. If we provide the capability, but  
leave it off by default, then people can test it against real  
applications and see the impact. Then we can better assess the right  
default settings.


Ralph



3) Defaults (if I understand correctly) can be set differently
on each cluster.


Yes, but the defaults should be sensible for the majority of
clusters.  If the majority do indeed share nodes between jobs
then I would suggest that the default should be off and the
minority who don't share nodes should have to enable it.


In debates on this subject, I've heard people argue that:

*) Though nodes are getting fatter, most are still thin.

*) Resource managers tend to space share the cluster.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-16 Thread Eugene Loh





Chris Samuel wrote:

  - "Eugene Loh"  wrote:
  
  
This is an important discussion.

  
  Indeed! My big fear is that people won't pick up the significance
of the change and will complain about performance regressions
in the middle of an OMPI stable release cycle.
  
2) The proposed OMPI bind-to-socket default is less severe. In the
general case, it would allow multiple jobs to bind in the same way
without oversubscribing any core or socket. (This comment added to
the trac ticket.)

  
  That's a nice clarification, thanks. I suspect though that the
same issue we have with MVAPICH would occur if two 4 core jobs
both bound themselves to the first socket.
  

Okay, so let me point out a second distinction from MVAPICH:  the
default policy would be to spread out over sockets.

Let's say you have two sockets, with four cores each.  Let's say you
submit two four-core jobs.  The first job would put two processes on
the first socket and two processes on the second.  The second job would
do the same.  The loading would be even.

I'm not saying there couldn't be problems.  It's just that MVAPICH2 (at
least what I looked at) has multiple shortfalls.  The binding is to
fill up one socket after another (which decreases memory bandwidth per
process and increases chances of collisions with other jobs) and
binding is to core (increasing chances of oversubscribing cores).  The
proposed OMPI behavior distributes over sockets (improving memory
bandwidth per process and reducing collisions with other jobs) and
binding is to sockets (reducing changes of oversubscribing cores,
whether due to other MPI jobs or due to multithreaded processes).  So,
the proposed OMPI behavior mitigates the problems.

It would be even better to have binding selections adapt to other
bindings on the system.

In any case, regardless of what the best behavior is, I appreciate the
point about changing behavior in the middle of a stable release. 
Arguably, leaving significant performance on the table in typical
situations is a bug that warrants fixing even in the middle of a
release, but I won't try to settle that debate here.

  
3) Defaults (if I understand correctly) can be set differently
on each cluster.

  
  Yes, but the defaults should be sensible for the majority of
clusters.  If the majority do indeed share nodes between jobs
then I would suggest that the default should be off and the
minority who don't share nodes should have to enable it.
  

In debates on this subject, I've heard people argue that:

*) Though nodes are getting fatter, most are still thin.

*) Resource managers tend to space share the cluster.

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-16 Thread Chris Samuel

- "Eugene Loh"  wrote:

> This is an important discussion.

Indeed! My big fear is that people won't pick up the significance
of the change and will complain about performance regressions
in the middle of an OMPI stable release cycle.

> Do note:
> 
> 1) Bind-to-core is actually the default behavior of many MPIs today.

We had this issue with MVAPICH before we dumped it to go to OpenMPI
as if we had (for example) two 4 core jobs running on the same node
they'd both go at half speed whilst the node itself was 50% idle.

Turned out they'd both bound to cores 0-3 leaving cores 4-7 unused. :-(

Fortunately there was an undocumented environment variable
that let us turn it off for all jobs, but getting rid of that
misbehaviour was a major reason for switching to OpenMPI.

> 2) The proposed OMPI bind-to-socket default is less severe. In the
> general case, it would allow multiple jobs to bind in the same way
> without oversubscribing any core or socket. (This comment added to
> the trac ticket.)

That's a nice clarification, thanks. I suspect though that the
same issue we have with MVAPICH would occur if two 4 core jobs
both bound themselves to the first socket.

Thinking further, it would be interesting to find out how this
code would behave on a system where cpusets is in use and so OMPI
has to submit to the will of the scheduler regarding cores/sockets.

> 3) Defaults (if I understand correctly) can be set differently
> on each cluster.

Yes, but the defaults should be sensible for the majority of
clusters.  If the majority do indeed share nodes between jobs
then I would suggest that the default should be off and the
minority who don't share nodes should have to enable it.

There's also the issue of those users who (for whatever reason)
like to build their own MPI stack and who are even less likely
to understand the impact that they may have on others.. :-(

cheers!
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-16 Thread George Bosilca

I tend to agree with Chris. Changing the behavior of the 1.3 in the  
middle of the stable release cycle, will be very confusing for our  
users. Moreover, as Ralph pointed out, everything in Open MPI is  
configurable so if we advertise this feature in the Changelog, the  
institutions where the nodes are not shared can easily amend their  
configuration files to take advantage of it. In particular, for Sun,  
if we push this feature in the 1.3.4 release, they can ship their  
version (derived from the 1.3.4) with the MCA parameter set to bind-to- 
whatever.


We can bring this topic in the spotlight for the next cycle (1.4/1.5).

  george.

On Aug 16, 2009, at 20:42 , Chris Samuel wrote:



- "Ralph Castain"  wrote:


Hi Chris


Hiya Ralph,


There would be a "-do-not-bind" option that will prevent us from
binding processes to anything which should cover that situation.


Gotcha.


My point was only that we would be changing the out-of-the-box
behavior to the opposite of today's, so all those such as yourself
would now have to add the -do-not-bind MCA param to your default MCA
param file.

Doable - but it -is- a significant change in our out-of-the-box
behavior.


I think this is too big a change in the default behaviour
for a stable release, it'll cause a lot of people pain for
no readily apparent reason.

I also believe that if those sites with multiple MPI jobs
on nodes are indeed in the majority then it makes more sense
to keep the default behaviour and have those who need this
functionality enable it on their installs.

Thoughts ?

cheers,
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-16 Thread Chris Samuel

- "Ralph Castain"  wrote:

> Hi Chris

Hiya Ralph,

> There would be a "-do-not-bind" option that will prevent us from
> binding processes to anything which should cover that situation.

Gotcha.

> My point was only that we would be changing the out-of-the-box
> behavior to the opposite of today's, so all those such as yourself
> would now have to add the -do-not-bind MCA param to your default MCA
> param file.
>
> Doable - but it -is- a significant change in our out-of-the-box
> behavior.

I think this is too big a change in the default behaviour
for a stable release, it'll cause a lot of people pain for
no readily apparent reason.

I also believe that if those sites with multiple MPI jobs
on nodes are indeed in the majority then it makes more sense
to keep the default behaviour and have those who need this
functionality enable it on their installs.

Thoughts ?

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-16 Thread Eugene Loh

This is an important discussion.  Do note:

1) Bind-to-core is actually the default behavior of many MPIs today.

2) The proposed OMPI bind-to-socket default is less severe.  In the
general case, it would allow multiple jobs to bind in the same way
without oversubscribing any core or socket.  (This comment added to the
trac ticket.)

3) Defaults (if I understand correctly) can be set differently on each
cluster.

Ralph Castain wrote:
There would be a "-do-not-bind" option that will prevent
us from binding processes to anything which should cover that situation.

My point was only that we would be changing the out-of-the-box behavior
to the opposite of today's, so all those such as yourself would now
have to add the -do-not-bind MCA param to your default MCA param file.

Doable - but it -is- a significant change in our out-of-the-box
behavior.

  On Sun, Aug 16, 2009 at 2:14 AM, Chris
Samuel 
wrote:

- "Terry Dontje"  wrote:

> I just wanted to give everyone a heads up if they do not get bugs
> email.  I just submitted a CMR to move over some new paffinity
options
> from the trunk to the v1.3 branch. 
https://svn.open-mpi.org/trac/ompi/ticket/1997

Ralphs comments imply that for those sites that share nodes
between jobs (such as ourselves, and most other sites that
I'm aware of in Australia) these changes will severely impact
performance.

I think that would be a Very Bad Thing(tm).

Can it be something that defaults to being configured out
for at least 1.3 please ?  That way those few sites that
can take advantage can enable it whilst the rest of us
aren't impacted.

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-16 Thread Ralph Castain

Hi Chris

There would be a "-do-not-bind" option that will prevent us from binding
processes to anything which should cover that situation.

My point was only that we would be changing the out-of-the-box behavior to
the opposite of today's, so all those such as yourself would now have to add
the -do-not-bind MCA param to your default MCA param file.

Doable - but it -is- a significant change in our out-of-the-box behavior.

On Sun, Aug 16, 2009 at 2:14 AM, Chris Samuel  wrote:

>
> - "Terry Dontje"  wrote:
>
> > I just wanted to give everyone a heads up if they do not get bugs
> > email.  I just submitted a CMR to move over some new paffinity options
> > from the trunk to the v1.3 branch.
>
> Ralphs comments imply that for those sites that share nodes
> between jobs (such as ourselves, and most other sites that
> I'm aware of in Australia) these changes will severely impact
> performance.
>
> I think that would be a Very Bad Thing(tm).
>
> Can it be something that defaults to being configured out
> for at least 1.3 please ?  That way those few sites that
> can take advantage can enable it whilst the rest of us
> aren't impacted.
>
> cheers,
> Chris
> --
> Christopher Samuel - (03) 9925 4751 - Systems Manager
>  The Victorian Partnership for Advanced Computing
>  P.O. Box 201, Carlton South, VIC 3053, Australia
> VPAC is a not-for-profit Registered Research Agency
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-16 Thread Chris Samuel

- "Terry Dontje"  wrote:

> I just wanted to give everyone a heads up if they do not get bugs 
> email.  I just submitted a CMR to move over some new paffinity options
> from the trunk to the v1.3 branch.

Ralphs comments imply that for those sites that share nodes
between jobs (such as ourselves, and most other sites that
I'm aware of in Australia) these changes will severely impact
performance.

I think that would be a Very Bad Thing(tm).

Can it be something that defaults to being configured out
for at least 1.3 please ?  That way those few sites that
can take advantage can enable it whilst the rest of us
aren't impacted.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

37 matches

Mail list logo