On Aug 17, 2009, at 12:11 PM, N.M. Maclaren wrote:

1) To have a mandatory configuration option setting the default, which would have a name like 'performance' for the binding option. YOU could then beat up anyone who benchmarkets without it for being biassed. This is a better solution, but the "I shouldn't need to have to think just because I
am doing something complicated" brigade would object.


Yes, BUT... We had a similar option to this for a long, long time. Marketing departments from other organizations / companies willfully ignored it whenever presenting competitive data. The 1,000,000th time I saw this, I gave up arguing that our competitors were not being fair and simply changed our defaults to always leave memory pinned for OpenFabrics-based networks.

To be clear: the option was "--mca mpi_leave_pinned 1" -- granted, the name wasn't as obvious as "--performance", but this option was widely publicized and easy to know that you should do for benchmarks (with a name like --performance, the natural question will be "why don't you enable [--]performance by default? This means that OMPI has --no- performance by default...?"). I would tell person/marketer X at a conference, "Hey, you didn't run with leave_pinned; our numbers are much better than that." "Oh, sorry" they would inevitably say; "I'll fix it next time I make new slides."

There are several problems that arise from this scenario:

1. The competitors aren't interested in being fair. Spin is everything. HPC is highly competitive.

2. Even if you tag someone in public for not being fair, they always say the same thing, "Oh sorry, my mistake" (regardless of whether they actually forgot or did it intentionally). I told several competitors *many times* that they had to use leave_pinned, but in all public comparison numbers, they never did. Hence, they always looked better.

(/me takes a moment to calm down after venturing down memory lane of all the unfair comparisons made against OMPI... :-) )

3. To some degree, "out of the box performance" *is* a compelling reason. Sure, I would hope that marketers and competitors to be ethical (they aren't, but you can hope anyway), but the naive / new user shouldn't need to know a million switches to get good performance.

Having good / simple switches to optimize for different workloads is a good thing (e.g., Platform MPI has some nice options for this kind of stuff). But the bottom line is that you can't rely on someone running anything other "mpirun -np x my_favorite_benchmark".

-----

Also, as an aside to many of the other posts, yes, this is a complex issue. But:

- We're only talking about defaults, not absolute behavior. If you want or need to disable/change this behavior, you certainly can.

- It's been stated a few times, but I feel that this is important: most other MPI's bind by default. They're deriving performance benefits from this. We're not. Open MPI has to be competitive (or my management will ask me, "Why are you working on that crappy MPI?").

- The Linux scheduler does no/cannot optimize well for many HPC apps; binding definitely helps in many scenarios (not just benchmarks).

- Of course you can construct scenarios where things break / perform badly. Particularly if you do Wrong Things. If you do Wrong Things, you should be punished (e.g., via bad performance). It's not the software's fault if you choose to bind 10 threads to 1 core. It's not the software's fault if you're on a large SMP and you choose to dedicate all of the processors to HPC apps and don't leave any for the OS (particularly if you have a lot of OS activity). And so on. Of course, we should do a good job of trying to do reasonable things by default (e.g., not binding 10 threads to one core by default), and we should provide options (sometimes automatic) for disabling those reasonable things if we can't do them well. But sometimes we *do* have to rely on the user telling us things.

- I took Ralph's previous remarks as a general statement about threading being problematic to any form of binding. I talked to him on the phone -- he actually had a specific case in mind (what I would consider Wrong Behavior: binding N threads to 1 core).

-----

Ralph and I chatted earlier; I would be ok to wait for the other 2 pieces of functionality to come in before we make binding occur by default:

1. coordinate between multiple OMPI jobs on the same node to ensure not to bind to the same cores (or at least print a warning)

2. follow the binding directives of resource managers (SLURM, Torque, etc.)

Sun is free to make binding-by-default in the ClusterTools distribution if/whenever they want, of course. I fully understand their reasoning for doing so. They're also in a better position to coach their users when to use which options, etc. because they have direct contact with their users (vs. the community Open MPI, where hundreds of people download Open MPI a day and we never hear from them). I *believe* that this option is also ok with Sun (I'm pretty sure Terry told me this last week), but I don't want to speak for them.

My $0.02.

--
Jeff Squyres
jsquy...@cisco.com

Reply via email to