On network failures: no. On fibre path failures: we configure ldiskfs with errors=panic so fibre issues or other issues in the storage path will likely cause a panic and trigger failover.
We're just getting started with failover so we elected to keep it simple for now. Jim On Mon, Jul 13, 2009 at 02:41:09PM -0600, Lundgren, Andrew wrote: > Are you doing anything if the network fails to one mds? > > How about if your fiber path fails? > > > -----Original Message----- > > From: Jim Garlick [mailto:garl...@llnl.gov] > > Sent: Monday, July 13, 2009 2:39 PM > > To: Lundgren, Andrew > > Cc: Carlos Santana; lustre-discuss@lists.lustre.org > > Subject: Re: [Lustre-discuss] failover software - heartbeat > > > > No. I originally did have it set up like this (a v1 ha.cf snippet): > > > > # One partner losing contact with both lnet routers or MDS triggers > > failover. > > #ping_group lnet-router 172.16.10.254 172.16.2.254 > > #ping_group tycho-mds1 172.16.10.200 172.16.2.200 > > #respawn hacluster /usr/lib64/heartbeat/ipfail > > > > However, I ran into a problem when rebooting the MDS. Apparently if > > one > > partner re-establishes contact with the MDS before the other one, it > > immediately triggers failover. This is with heartbeat-2.1.4. > > > > Jim > > > > On Mon, Jul 13, 2009 at 02:25:17PM -0600, Lundgren, Andrew wrote: > > > Were you able to get monitoring working to detect network failures? > > (pingd?) > > > > > > I have it configured, but haven't been able to get it to trigger a > > failover when an MDS cannot ping the network. (I tried with 1.0 and > > 2.0 conf files, I am currently using 2.0) I have a ticket open with > > the pacemaker project (no ticket system for the HA stuff...) > > > but not resolution. I am considering writing a script to down the > > node when the ping fails, but don't like the idea. > > > > > > I would also like to get the hpingd functioning to detect a fiber > > failure, but there was less available on that solution. > > > > > > -- > > > Andrew > > > > > > > -----Original Message----- > > > > From: Jim Garlick [mailto:garl...@llnl.gov] > > > > Sent: Monday, July 13, 2009 2:21 PM > > > > To: Lundgren, Andrew > > > > Cc: Carlos Santana; lustre-discuss@lists.lustre.org > > > > Subject: Re: [Lustre-discuss] failover software - heartbeat > > > > > > > > We recently put heartbeat v1 in production and along the way > > > > developed some admin scripts including heartbeat resource agent > > > > compliant > > > > lustre init scripts, a script to initiate failover/failback and get > > > > detailed > > > > status, a powerman stonith interface, and various safeguards to > > ensure > > > > MMP > > > > is on, devices are present and usable, etc. before starting lustre. > > > > > > > > If this is of general interest I could post it to a bug for review. > > > > > > > > Jim > > > > > > > > On Mon, Jul 13, 2009 at 01:45:02PM -0600, Lundgren, Andrew wrote: > > > > > It is very difficult to find relevant documentation for heartbeat > > > > 1/2. I just finished configuring a heartbeat system and would not > > > > recommend it because of the documentation. (They seem to have > > removed > > > > portions the heartbeat documentation from the site.) > > > > > > > > > > Pacemaker is not a simple solution to configure either. I played > > > > briefly with the RH clustering software. It does not directly > > support > > > > any FS type other than the basic ext2/ext3, and wasn't happy with a > > > > lustre type. > > > > > > > > > > -- > > > > > Andrew > > > > > > > > > > > -----Original Message----- > > > > > > From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre- > > > > discuss- > > > > > > boun...@lists.lustre.org] On Behalf Of Carlos Santana > > > > > > Sent: Monday, July 13, 2009 11:42 AM > > > > > > To: lustre-discuss@lists.lustre.org > > > > > > Subject: [Lustre-discuss] failover software - heartbeat > > > > > > > > > > > > Howdy, > > > > > > > > > > > > The lustre manual recommends heartbeat for handling failover. > > The > > > > > > pacemaker is successor of hearbeat version 2. So whats > > recommended > > > > - > > > > > > should we be using pacemaker or stick to hearbeat? > > > > > > > > > > > > - > > > > > > CS. > > > > > > _______________________________________________ > > > > > > Lustre-discuss mailing list > > > > > > Lustre-discuss@lists.lustre.org > > > > > > http://***lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > _______________________________________________ > > > > > Lustre-discuss mailing list > > > > > Lustre-discuss@lists.lustre.org > > > > > http://***lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss