On Aug 17, 2009, at 12:11 PM, N.M. Maclaren wrote:
1) To have a mandatory configuration option setting the default,
which
would have a name like 'performance' for the binding option. YOU
could then
beat up anyone who benchmarkets without it for being biassed. This
is a
better solution, but the "I shouldn't need to have to think just
because I
am doing something complicated" brigade would object.
Yes, BUT... We had a similar option to this for a long, long time.
Marketing departments from other organizations / companies willfully
ignored it whenever presenting competitive data. The 1,000,000th time
I saw this, I gave up arguing that our competitors were not being fair
and simply changed our defaults to always leave memory pinned for
OpenFabrics-based networks.
To be clear: the option was "--mca mpi_leave_pinned 1" -- granted, the
name wasn't as obvious as "--performance", but this option was widely
publicized and easy to know that you should do for benchmarks (with a
name like --performance, the natural question will be "why don't you
enable [--]performance by default? This means that OMPI has --no-
performance by default...?"). I would tell person/marketer X at a
conference, "Hey, you didn't run with leave_pinned; our numbers are
much better than that." "Oh, sorry" they would inevitably say; "I'll
fix it next time I make new slides."
There are several problems that arise from this scenario:
1. The competitors aren't interested in being fair. Spin is
everything. HPC is highly competitive.
2. Even if you tag someone in public for not being fair, they always
say the same thing, "Oh sorry, my mistake" (regardless of whether they
actually forgot or did it intentionally). I told several competitors
*many times* that they had to use leave_pinned, but in all public
comparison numbers, they never did. Hence, they always looked better.
(/me takes a moment to calm down after venturing down memory lane of
all the unfair comparisons made against OMPI... :-) )
3. To some degree, "out of the box performance" *is* a compelling
reason. Sure, I would hope that marketers and competitors to be
ethical (they aren't, but you can hope anyway), but the naive / new
user shouldn't need to know a million switches to get good performance.
Having good / simple switches to optimize for different workloads is a
good thing (e.g., Platform MPI has some nice options for this kind of
stuff). But the bottom line is that you can't rely on someone running
anything other "mpirun -np x my_favorite_benchmark".
-----
Also, as an aside to many of the other posts, yes, this is a complex
issue. But:
- We're only talking about defaults, not absolute behavior. If you
want or need to disable/change this behavior, you certainly can.
- It's been stated a few times, but I feel that this is important:
most other MPI's bind by default. They're deriving performance
benefits from this. We're not. Open MPI has to be competitive (or my
management will ask me, "Why are you working on that crappy MPI?").
- The Linux scheduler does no/cannot optimize well for many HPC apps;
binding definitely helps in many scenarios (not just benchmarks).
- Of course you can construct scenarios where things break / perform
badly. Particularly if you do Wrong Things. If you do Wrong Things,
you should be punished (e.g., via bad performance). It's not the
software's fault if you choose to bind 10 threads to 1 core. It's not
the software's fault if you're on a large SMP and you choose to
dedicate all of the processors to HPC apps and don't leave any for the
OS (particularly if you have a lot of OS activity). And so on. Of
course, we should do a good job of trying to do reasonable things by
default (e.g., not binding 10 threads to one core by default), and we
should provide options (sometimes automatic) for disabling those
reasonable things if we can't do them well. But sometimes we *do*
have to rely on the user telling us things.
- I took Ralph's previous remarks as a general statement about
threading being problematic to any form of binding. I talked to him
on the phone -- he actually had a specific case in mind (what I would
consider Wrong Behavior: binding N threads to 1 core).
-----
Ralph and I chatted earlier; I would be ok to wait for the other 2
pieces of functionality to come in before we make binding occur by
default:
1. coordinate between multiple OMPI jobs on the same node to ensure
not to bind to the same cores (or at least print a warning)
2. follow the binding directives of resource managers (SLURM, Torque,
etc.)
Sun is free to make binding-by-default in the ClusterTools
distribution if/whenever they want, of course. I fully understand
their reasoning for doing so. They're also in a better position to
coach their users when to use which options, etc. because they have
direct contact with their users (vs. the community Open MPI, where
hundreds of people download Open MPI a day and we never hear from
them). I *believe* that this option is also ok with Sun (I'm pretty
sure Terry told me this last week), but I don't want to speak for them.
My $0.02.
--
Jeff Squyres
jsquy...@cisco.com