----- Forwarded message from H�kon Bugge <[EMAIL PROTECTED]> ----- From: H�kon Bugge <[EMAIL PROTECTED]> Date: Wed, 10 Nov 2004 10:37:56 +0100 To: Chris Sideroff <[EMAIL PROTECTED]> Cc: [EMAIL PROTECTED] Subject: Re: [Beowulf] torus versus (fat) tree topologies X-Mailer: QUALCOMM Windows Eudora Version 6.2.0.14
Chris, I have a view on the topic, having delivered professional software for both 2D and 3D SCI torus topologies as well as for Gbe, Myri, and IB centralized switch topologies. First, in such a discussion it is hard to separate _implementation_ from _architecture_. I would state that an implementation of a 2D/3D torus topology can have very short latencies. But why is it? Using SCI from Dolphin you would observe very short latencies, but this stems from the NIC SW/HW architecture, not the topology per se in my view. For example, a 64-bit, 66MHz PCI with Dolphin NIC has lower latency than a modern PCI-e NIC from Mellanox, both measured with a direct cable (i.e. a two-node ringlet using SCI) using the same SW stack (Scali MPI Connect). [However, the payload does not have to grow very large before the lack of bandwidth becomes a hinder to latency. A 4x IB PCI-e one-way traffic of 1k payload is about 600MB/s, which, as a side note, is faster than Cray XD1 despite their claim being 2x faster that 4x IB on short messages ;-)] However, a primary trend seems to make torus topologies less attractive. Although it is true that the bi-section bandwidth scales (not linear though) with the size of the system, a decent bi-section bandwidth requires the links which makes up the torus to be significant faster than the I/O of the compute nodes. For example, looking at a 1D torus (ring). How much total bandwidth is available for the attached nodes assuming uniformly distributed traffic? The answer is approximately 1.5 times the bandwidth of the individual link segments. Given the link-speed and the effective compute-node I/O bandwidth though the actual NIC, its simple arithmetic to calculate how many nodes it is applicable to have in each dimension of the torus. However, my observation is that the link speed of todays interconnects and the I/O speed of the nodes seems to get closer and closer. If this is true, I would claim that the applicability of torus topologies for systems with I/O bus attachment will become less attractive over time. The latter from a _bandwidth_ centric view. Other factors in deciding the best suitability between the two topologies have to a large extent been commented. One issue though, is that on-site spare-parts are fewer and less expensive for tori, but this factor is of course most important for smaller system, measuring cost_of_spare_parts as a fraction of the total interconnect cost. Fault-tolerance cost could also be less expensive with a torus. If one random power supply breaks down in a torus, it must be that of a compute node, and the impact of that is 1/Nth of the system (assuming a decent run-time system which dynamically recalculates routes). If the power supply of a centralized switch breaks down, you loose the whole system. Of course this can be alleviated by (multiple) dual power-supplies, etc., but cost would typically be higher than in the torus case. Also, an argument in favour of a torus topology, could be linear incremental growth cost. Slightly exceeding no_of_ports available in a switch will sometimes significant increase the average cost per port, if full bi-section bandwidth is to be maintained. The obvious drawback of tori topologies is cabling, assuming the torus is implemented with two cables per dimension. You get significant more cables, implying longer deployment times and more complicated node replacement. In larger systems though, cabling of centralized switched tends to require very _long_ cables, something you do not need using tori topologies. We have some interesting results of HPCC using same node hardware, same SW stack, for 2D SCI, Gbe, Myrinet, and IB. If interested, we can probably disclose these numbers to you. * Hakon (Hakon.Bugge _ AT_ scali.com) _______________________________________________ Beowulf mailing list, [EMAIL PROTECTED] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ----- End forwarded message ----- -- Eugen* Leitl <a href="http://leitl.org">leitl</a> ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net ------- To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/[EMAIL PROTECTED]
pgpjTdW5FNWbO.pgp
Description: PGP signature
