On Thu, Aug 28, 2008 at 08:41:18AM -0400, Prentice Bisbal wrote:
> 
> Since an infiniband fabric needs a subnet mananger, should the master
> node have an IB HCA and be connected to the IB network in order to run
> the subnet manager?
> 
> My logic behind this is that the master node will be full
> enterprise-level hardware (redundant every thing), and should never go
> down or be rebooted during normal use. I expect the nodes to go down
> more frequently (not fully redundant hardware, higher operating loads,
> etc.).
> 
> Exactly what functions does the subnet manager perform, and what happens
> if it disappears from the IB fabric?
> 
> I've been doing research into IB all day yesterday, and I'm continuing
> today, so please no RTFM answers.

How big a fabric?

The subnet manager (SM) manages the fabric.
The most obvious functions are
        * assign LID (local ID)
        * setup routing (routing is static BTW)
        * notices changes.

i.e. discovery, configuration and continuous monitoring of the fabric

Once a fabric is live and correctly setup if the subnet manager dies
nothing bad happens unless something changes.   The assigned LIDs
continue to be valid and the routes continue to be valid.  You only
loose monitoring.

Some vendor switches have the ability to manage fabrics with a built in subnet 
management
card (extra $).    In many cases this it the best solution...

If the SM is on the head node it might be easier to watch the SM ....

In the subnet management specification there is stuff about fail over...
It is possible to have a second subnet manager running on the fabric.  The 
second SM should go idle
and only be active if the other one goes silent.    

Caution #1 -- failover is hard to test and multiple SMs may introduce 
instability so test, test but
                do not tinker on a prodution fabric.   Do monitor -- gently is 
fine.

Caution #2 -- do not mix subnet managers.  If you run a second SM run one that
                is identical!  Do not mix OpenSM and a managed switch without
                vendor approval and testing....  do not mix versions of any 
SM...

Caution #3 -- Like so many things one is good (required in this case), two 
might be nice but many is just wrong.

This is a good URL to read and bookmark...

   http://infiniband.sourceforge.net/SM/overview.htm

Google for OpenSM,  Cisco pages have some good stuff too.



-- 
        T o m  M i t c h e l l 
        Got a great hat... now what.

_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to