2% comeon,

How do you plan to lose 'just 2%' if you make a lot of use from MPI?

let's be realistic; with respect to matrix calculations HPC can be relative efficient. As soon as we discuss algorithms that have the habit to be sequential, then they are rather hard to parallellize at a HPC box. Even very good scientists usually lose a factor 50 then or so
algorithmic.

It is questionable whether software that is embarassingly parallel should be run at megamillion dollar machines that are easily factor 5 less efficient in power,
provided it can work somehow well at normal PC/CUDA/Brooke type hardware
(meaning that some scientists love RAM just a tad too much; i'd argue there is always algorithms possible, though very complex sometimes, that can get a lot of performance with a tad less RAM,
after which you can move again to cheaper hardware).

I'd argue there is a very BIG market for a shared memory numa approach, one that has however a better solution for i/o and timing (so not using some sort of central clock and central i/o processors
like SGI used to at the Origin boxes).

The few shared memory approaches that were historically faster than a PC, were that much more expensive than a PC to just increase speed by a factor 2, that it is interesting to see what will happen here.

the step from writing multithreaded/multiprocessing software that works at NUMA hardware to
a MPI type model is really big.

What happens as a result of that is that those MPI type approaches usually are not very well optimized software programs. The "one eyed software in the land of the blind", so to speak.

Sometimes that has very egoistic reasons. I've seen cases that doing more calculations gives bigger round off errors which after a few months backtrack into the root bigtime, causing the scientist to be able to draw sometimes the result he liked to draw, instead of objectively also being able to explain why the 'commercial' model that gets calculated quickly, which sometimes exist which is why we know this, doesn't have those weird 'random' results, so no new theory can get concluded.

I would be really amazed if more than 50% in this HPC list in their typical workloads gets an efficiency of over 2%.

We shouldn't praise ourselves to be better than we are simply. Having lots of processors also makes most scientists very lazy. That isn't bad at all, the idea majority of scientists use HPC is that you can take a look into the future what happens,
giving an advantage over a PC.

That said there is a few fields where the efficiency IS real real high.

But other than some guys who are busy with encryption i wouldn't be able to mention a single one to you. Yet you could also argue that those guys in fact waste most resources of everyone, as there is special co-processors (for embedded for example) and special dedicated processors (using a LOT of watts) made that are thousands of times faster than what you can do in a generic cpu, in which case the 2% rule still is valid.

In HPC there is however 1 thing i really miss. I'm convinced it exists, a kind of GPU type cpu, with a lot of memory controllers attached, that's doing calculations in double precision. A smal team of 5 persons can build it and clock is oh 300-350Mhz or so?

So the investment in itself isn't big. Getting to 1 Teraflop double precision a cpu shouldn't be a big problem.

Where is that cpu?

Did no one care to design it as they can't make billions of dollars with it?

Vincent

On Sep 25, 2008, at 12:20 AM, Mark Hahn wrote:

that, perhaps serendipitously, these service level delays due to nodes
not being completely optimized for cluster use don't result in a
significant reduction of computation speed until the size of the
cluster is about at the point where one would want a full-time admin
just to run the cluster.

no, not really. the issue is more like "how close to the edge are you?" it's the edge-closeness (relative to cluster capabilities) that matters.

that is, if your program has very frequent global synchronization,
you're going to want low jitter. yes, exponentially more so as the size of the job grows, but the importance of the issue also grows as your CPU increases in speed, as your interconnect improves, etc.

similarly, if you have an app which is finely cache-tuned,
it'll hurt, possibly a lot, when monitoring/etc takes a bite out.

don't worry about these service details too much, just do your work
knowing that you're maybe losing 2% speed (this number is a total
guesstimate).

2% might be reasonable if you're doing very non-edge stuff - for instance, a lot embarassingly parallel or serial-farm workloads that don't use a lot of memory. it's not that those workloads are less worthy, just that they tolerate a lot more sloppiness.

again, it's the nature of the workload, not just size of the cluster.
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to