On Mon, 8 Jun 2009, NiftyOMPI Tom Mitchell wrote:

??? dual rail does double the number of switch ports. If you want to address switch failure each rail must connect to a different switch. If you do not want to have isolated fabrics you must have some additional ports on all switches to connect the two fabrics and enough of them to maintain sufficient bandwidth and connectivity when a switch fails. Thus, You are doubling the fabric unless I am missing something.
Well, it is pretty much research for now. But yes, we want each port to be connected to a different switch so that both cable and switch failures can be survived.

Open MPI currently needs to have connected fabrics, but maybe that's something we will like to change in the future, having two separate rails. (Btw Pasha, will your current work enable this ?)

Is your second set of switches so minimally connected that the second tree can be installed with a small switch count.
That's the idea, yes. For example, you could have a primary QDR fat-tree network and a failover non fat-tree DDR one (potentially recycled from a previous machine).

What are the odds when port 1 fails that port 2 is going to
be live.  Cable/ connector errors would be the most likely
case where port 2 would be live.  In general if port 1 fails
I would expect port 2 to have issues too.
Well, depending on the errors you want to be able to survive, you may have 2 cards, in which case there is no reason why port1 failure would cause port2 to fail too. But in all cases, switches and cable errors are a concern to us.

Sylvain

Reply via email to