Toon Knapen wrote:
Mark Hahn wrote:

unless most of your IPC is this kind of async, unsync, passive data
reference, I wouldn't think twice: go MPI.  the current media frenzy
about multicore systems (nothing new!) doesn't change the picture much.

Because of everybody going multi-core, everybody is pushing to go multi-threading to exploit these architectures (e.g. the gaming-world and many more). IIUC you're saying that MPI might better exploit these architectures? Interesting POV!

Multicore has some interesting up sides. The down sides, oversubscription of memory bandwidth for the memory pipes out of the sockets, remind me of the days of larger SMP boxes with big busses in the early/mid 90s.

First, shared memory is nice and simple as a programming model. Multicore suggests that shared memory should be very easy to exploit. You have to worry about contention, affinity, and everything else we used to have to worry about a decade ago with the big machines. Your precious resources that you need to optimize utilization of are no longer CPU cycles, but bandwidth.

Second, MPI is a more complex model. It forces you to reconsider how the algorithm is mapped to the hardware. And it makes no assumptions about the hardware, at least in the API. In the implementation, it might be taught about multi-core, and optimizing communication within boxes via shm sockets, and between boxes by other methods. I think a few of the MPI toolkits do this today (Scali, Intel, OpenMPI, ...).

Neither one of these modalities take into account the fact that memory bandwidth is finite out of a socket. Technically this is an implementation issue, but as we hit larger and larger core sizes, some codes, well, larger fractions of the parallel code base, are likely to run into this resource contention issue.

We were seeing contention for fabric interconnects (e.g. bus contention) with LAMMPS runs for a customer last year simply between single and dual core. It was significant enough that the customer opted for single core. This contention is not going to get better as you increase the number of cores. Since MPI does, in part, depend upon resources being contended for (interconnect), it is not at all clear to me that MPI will be the *best* choice for programming all the cores, though it certainly would be a simple choice.

Greg is right when he notes that the hybrid model is a challenge. Unfortunately we appear to be facing a regime with multiple layers of hierarchies. So this will need resolution. You can create a globally "optimal" code via MPI, that may not be as efficient locally as you like, and will likely grow less so with more cores, or a locally optimal never-get-out-of-the-box code via shared memory.

Shared memory scales nicely on NUMA machines, assuming 1-2 cores per memory controller. It won't/doesn't scale with 8 cores and one memory bus. How well does stream run on clovertown? NAS parallel?

The issue is, at the end of the day, the contended for resources.

Joe




t
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to