Roman Shaposhnik wrote:
One question that I still have, though,
is what
makes you think that once you're done with porting gcc (big task) and
porting HPC apps to
gcc/Plan9 (even bigger one!) they will *execute* faster than they do on
Linux ?
Excellent question.
It's all about parallel performance; making sure your 1000 nodes run
1000 times as fast as 1 node, or, if they don't, that it's Somebody
Else's Problem. The reason that the OS can impact parallel performance
boils down to the kinds of tasks that go on in OSes that can be run at
awkward times,and in turn interfere with parallel applications, and
result in degraded performance. (for another approach, see Cray's
synchronised scheduler work; make all nodes schedule the app at the same
time).
Imagine you have one of these lovely apps, on a 1000-node cluster with a
5-microsecond latency network. Let us further imagine (this stuff
exists; see Quadrics) that you can do a broadcast/global sum op in 5
microseconds. After 1 millisecond, they all need to talk to each other,
and can not proceed until they're all agreed on (say) the value of a
computed number -- e.g. some sort of global sum of a variable held by
each of 1000 procs. The generic term for this type of thing is 'global
reduction' -- you reduce a vector to a scalar of some sort.
The math is pretty easy to do, but it boils down to this: OS activities
can interfere with, say, just one task, and kill the parallel
performance of the app, making your 1000-node app run like a 750 node
app -- or worse.
Pretend you're delayed one microsecond; do the math; it's depressing.
One millisecond compute interval is a really extreme case, chosen for
ease of illustration, but ...
In the clustering world, what a lot of people do is run real heavy nodes
in clusters -- they have stuff like cron running, if you can believe it!
They pretty much do a full desktop install, then turn off a few daemons,
and away they go. Some really famous companies actually run clusters
this way -- you'd be surprised at who. So do some famous gov't labs.
If they're lucky, interference never hits them. If they're not, they get
less-than-ideal app performance. Then, they draw a conjecture from the
OS interference that comes with such bad configuration: you can't run a
cluster node with anything but a custom OS which has no clock
interrupts, and, for that matter, no ability to run more than one
process at a time. See the computational node kernel on the BG/L for one
example, or the catamount kernel on Red Storm. Those kernels are really
constrained; just running one proc at a time is only part of the story.
Here at LANL, we run pretty light cluster nodes.
Here is a cluster node running xcpu (under busybox, as you can see):
1 ? S 0:00 /bin/ash /linuxrc
2 ? S 0:00 [migration/0]
3 ? SN 0:00 [ksoftirqd/0]
4 ? S 0:00 [watchdog/0]
5 ? S 0:00 [migration/1]
6 ? SN 0:00 [ksoftirqd/1]
7 ? S 0:00 [watchdog/1]
8 ? S 0:00 [migration/2]
9 ? SN 0:00 [ksoftirqd/2]
10 ? S 0:00 [watchdog/2]
11 ? S 0:00 [migration/3]
12 ? SN 0:00 [ksoftirqd/3]
13 ? S 0:00 [watchdog/3]
14 ? S< 0:00 [events/0]
15 ? S< 0:00 [events/1]
16 ? S< 0:00 [events/2]
17 ? S< 0:00 [events/3]
18 ? S< 0:00 [khelper]
19 ? S< 0:00 [kthread]
26 ? S< 0:00 [kblockd/0]
27 ? S< 0:00 [kblockd/1]
28 ? S< 0:00 [kblockd/2]
29 ? S< 0:00 [kblockd/3]
105 ? S 0:00 [pdflush]
106 ? S 0:00 [pdflush]
107 ? S 0:00 [kswapd1]
109 ? S< 0:00 [aio/0]
108 ? S 0:00 [kswapd0]
110 ? S< 0:00 [aio/1]
111 ? S< 0:00 [aio/2]
112 ? S< 0:00 [aio/3]
697 ? S< 0:00 [kseriod]
855 ? S 0:00 xsrv -D 0 tcp!*!20001
857 ? S 0:00 9pserve -u tcp!*!20001
864 ? S 0:00 u9fs -a none -u root -m 65560 -p 564
865 ? S 0:00 /bin/ash
see how little we have running? Oh, but wait, what's all that stuff in
[]? It's the stuff we can't turn off. Note there is per-cpu stuff, and
other junk. Note that this node has been up for five hours, and this
stuff is pretty quiet(0 run time); our nodes are the quietest (in the OS
interference sense) Linux nodes I have yet seen. But, that said, all
this can hit you.
And, in Linux, there's a lot of stuff people are finding you can't turn
off. Lots of timers down there, lots of magic that goes on, and you just
can't turn it off, or adjust it, try as you might.
Plan 9, our conjecture goes, is a small, tight, kernel, with lots of
stuff moved to user mode (file systems); and, we believe that the Plan 9
architecture is a good match to future HPC (High Performance Computing)
systems, as typified by Red Storm and BG/L: small, fixed-configuration
nodes with memory, network, CPU, and nothing else. The ability to not
even have a file system on the node is a big plus. The ability to
transparently have the file system remote/local puts the application
into the driver's seat as to how the node is configured, and what
tradeoffs are made; the system as a whole is incredibly flexible.
Our measurements, so far, do show that Plan 9 is "quieter" than Linux. A
full Plan 9 desktop has less OS noise than a Linux box at the login
prompt. This matters.
But it only matters if people can run their apps. Hence our concern
about getting gcc-based cra-- er, applications code, running.
I'm not really trying to make Plan 9 look like Linux. I just want to run
MPQC for a friend of mine :-)
thanks
ron