Re: [Linux-HA] Weird auto_failback behaviour

Dejan Muhamedagic Tue, 27 Mar 2007 07:11:22 -0800

On Tue, Mar 27, 2007 at 04:08:27PM +0200, Max Hofer wrote:
> On Tuesday 27 March 2007 13:11, Dejan Muhamedagic wrote:
> > On Tue, Mar 27, 2007 at 10:17:22AM +0200, Andrew Beekhof wrote:
> > > 
> > > On Mar 27, 2007, at 1:19 AM, Michael Dodd wrote:
> > > 
> > > >Alan Robertson wrote:
> > > >>
> > > >>That would probably be because you've created a split-brain  
> > > >>situation,
> > > >>and heartbeat is recovering from it by restarting the services on  
> > > >>both
> > > >>machines.
> > > >>
> > > >>http://linux-ha.org/SplitBrain
> > > >>
> > > >>Generally, you want to avoid a split-brain condition.  If you have
> > > >>shared storage you REALLY want to avoid it - since it will trash your
> > > >>data.  http://linux-ha.org/BadThingsWillHappen
> > > >>
> > > >>
> > > >>
> > > >Thanks-I wondered if that's what was happening.
> > > >
> > > >Am I going to need to get STONITH configured for this?  We're not  
> > > >doing any kind of resource sharing on realservers, so I'd like to  
> > > >avoid the complexity there.   We're looking for something similar to  
> > > >what Daniel Bray has mentioned in his recent mail to the list, but  
> > > >ideally I'd like to avoid the added complexity of having to maintain  
> > > >cib.xml.
> > > 
> > > maintain?
> > > sure its a bit more complex to set up but what do you mean by maintain?
> > 
> > as a matter of fact, you'll be so much better off with the crm
> > based cluster (v2) when it comes to maintenance. v1 is definitely
> > easier to start with, but once you get the v2 going you'll find it
> > more enjoyable for administration.
> I agree with you from the point of view of a cluster system 
> designer/tester but I disagree from the point of view of a 
> customer (the person who bought the cluster).


hmm. does the customer have skilled personel? after all, whoever's
going to manage a cluster (any kind of cluster) has to have
certain admin skills. it's definitely not like getting a household
appliance.

> Lets see what operations a normal sysadmin had to do with heartbeat v1
> and compare it to v2:
> 
> heartbeat v1:
> * start/stop heartbeat
> * make a node standby --> forced switchover to the other node

yes, and that's more or less _all_ it can do.

> All those commands are available to v2 BUT the currently used XML
> environment is not ... customer friendly (the one who does not know
> much about cluster etc.). You have to explain:
> * what resources are and what state they can have
> * how he can retrieve the state
> * how he can see where the resource is running
> * when does a resource change its state
> * what fail-counts are, and what effect they have on the system

as i said, you have to have skilled people to run this kind of
thing, no?

> * the tool set to control the cluster

there's a lack of a friendly tool set, that's true, especially if
one prefers a command line. the gui though should be up to all the
tasks you mentioned here, but i'm afraid that i can't vouch in
that department---i'm not very gui friendly :)

> * certain errors even dont show up (like a stop-restart failure) and you
> need geek-commands like ptest to find them

that's something we should be working on: better logging. ptest,
yes, that's probably not for the average joe.

> The UI interface is sometimes not an option because:
> - the cluster runs on linux without installed X

running x11 on a cluster is not a requirement. running x11 on the
sysadmin's workstation is:

    display/x11/keyboard/mouse/user <== haclient.py <== TCP/IP ==> cluster

> - there are no linux cients which could communicate with the cluster

this i don't understand.

> - the client has options which should not be visible to the customer (like
> removing/change resource or resource parameters)

agreed. that's on the todo list, but probably not going to happen
soon.

> Just my 2 cents - i have a hard time to explain all those commands 
> (crm_mon, crm_resource, cibadmin, ptest) to our customer.

right.

i also have some field experience and, though it was definitely
not easy in the beginning, i will now maintain that the
maintainance is definitely better on the v2 style config. that
customer of mine got a mgmt script with which they can start/stop
applications (basically a crm_resource wrapper) and do application
upgrades and they never had problems using it. and otherwise they
have no idea about heartbeat, i'm not sure if they even know what
is running behind. the configuration and heartbeat upgrades are,
however, left to people who do understand how heartbeat works.

this is why i would opt for the v2 cluster:

- software upgrades (incl. heartbeat) are well supported
- resource management is supported on a much finer scale than it
  was before
- overall, there is a much better control of anything in the
  cluster
- finer grain configuration

many thanks for the input. i trust that people developing
heartbeat (me included) listen and are trying to make it better.
the field experience is essential there.

> 
> kind regards,
> Max

-- 
Dejan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Weird auto_failback behaviour

Reply via email to