Dear developers and users of Open MPI,

because the links to the following implementation will disappear at the
end of March 2007, I thought it would be worth to make this announcement.

During the work on my diploma thesis last year, I developed a new MPI
broadcast algorithm that utilizes a hardware feature called "Multicast".
In most application cases, this algorithm scales independently of the
number of involved MPI processes (even for very small message sizes!).
So you won't see any performance differences between a broadcast to
10, 100 or even 1000 processes. We have shown this behaviour on our
Beowulf-style cluster at Chemnitz University of Technology with 528
compute nodes and a single large Fast Ethernet switch. It usually
outperforms all existing broadcast implementations when more than 8
cluster nodes are used.


source code of the original implementation, called "ipmc":
http://www-user.tu-chemnitz.de/~chsi/ipmc_component.tar.gz

latest update patch (necessary because of internal changes in Open MPI!):
http://www-user.tu-chemnitz.de/~chsi/ipmc_update.patch
(successfully tested with the latest SVN revision 14004)

Still not convinced? You can even use it to improve your HPL performance:
http://www-user.tu-chemnitz.de/~chsi/hpl/


A detailed explanation of the algorithm with its performance evaluation
can be found in my diploma thesis "Efficient Broadcast for Multicast-
Capable Interconnection Networks". A more compact description (although
for a different interconnect) can be found in the paper "A practically
constant-time MPI Broadcast Algorithm for large-scale InfiniBand
Clusters with Multicast" (CAC 2007).

advantages:
- alternative MPI_BCAST implementation
- uses hardware multicast and not only point-to-point communication
- scales independently of communicator size
- developed for productive application use (does not explicitly
 synchronize, makes use of process skew, very balanced, ...)
- supports e.g. Fast/Gigabit Ethernet through IPv4 interface
- implemented as a self-contained Open MPI collective component
- quite simple main algorithm; solves all related problems
- works with any communicator (not only MPI_COMM_WORLD or root=0!)
- no special setup required (works even for multiple jobs per cluster!)
- ensures correct data delivery (CRC + retransmission)
- user-customizable through MCA parameters (for fine tuning)
- especially well-suited for small- and medium-sized messages
- no need for application change or rebuild
 (simply copy the binaries into the Open MPI library directory)
- heavily tested and successfully used with many benchmarks
 (e.g. IMB, HPL) as well as real applications (e.g. Abinit, CPMD)
- < some more that I've forgotten at the time of writing ;-) >

disadvantages:
- scales, in the worst case, linearly with the number of processes
- not optimal for small communicators and/or large messages
- only the MPI_BCAST operation is implemented => falls back to the basic
 implementation for all other collective operations
- only tested on a smaller number of clusters (namely with Fast/Gigabit
 Ethernet and IPoIB; IA-32 and x86-64 architectures)
- no automatic parameter adaption (only default values)
- quite large implementation (due to the many details/solved problems)
- < surely more - so please report them >


Although it is unclear if I can maintain and enhance this project in the
future, it is currently in an (almost) production-ready state. So I give
it to the Open MPI community and hope it will be useful for anyone.

Don't hesitate to contact me [1] if you have any comments. I'm especially
interested in more applications that can benefit from this implementation.


"Now it is possible to use MPI_BCAST for large-scaling applications."


Yours sincerely

  Christian Siebert


[1] This e-mail address will become invalid at the end of March 2007 too.


Reply via email to