On Mon, 8 Jun 2009, NiftyOMPI Tom Mitchell wrote:
??? dual rail does double the number of switch ports. If you want to
address switch failure each rail must connect to a different switch.
If you do not want to have isolated fabrics you must have some
additional ports on all switches to connect the two fabrics and enough
of them to maintain sufficient bandwidth and connectivity when a switch
fails. Thus, You are doubling the fabric unless I am missing something.
Well, it is pretty much research for now. But yes, we want each port to be
connected to a different switch so that both cable and switch failures can
be survived.
Open MPI currently needs to have connected fabrics, but maybe that's
something we will like to change in the future, having two separate rails.
(Btw Pasha, will your current work enable this ?)
Is your second set of switches so minimally connected that the second
tree can be installed with a small switch count.
That's the idea, yes. For example, you could have a primary QDR fat-tree
network and a failover non fat-tree DDR one (potentially recycled from a
previous machine).
What are the odds when port 1 fails that port 2 is going to
be live. Cable/ connector errors would be the most likely
case where port 2 would be live. In general if port 1 fails
I would expect port 2 to have issues too.
Well, depending on the errors you want to be able to survive, you may have
2 cards, in which case there is no reason why port1 failure would cause
port2 to fail too. But in all cases, switches and cable errors are a
concern to us.
Sylvain