Re: [Linux-HA] Issues with simple failover setup

Stephen Nelson-Smith Mon, 05 Jan 2009 04:04:33 -0800

Hi,
>> Ok - so taking my cluster, erasing the cib with cibadmin -E, and
>> rebooting both nodes.  I've not got httpd starting by default on
>> either machine, so when they come up, I will start httpd  on one
>> machine.  Interestingly the result of cibadmin -E seems to have been
>> that cibadmin -Q now times out,
>
> Shouldn't happen.


I'll try to reproduce that again.
>
>> so I've hacked around a bit deleting
>> /var/lib/heartbeat/crm/cib.xml and trying to load it, by making the
>> admin_epoch bigger than that which seemed to be there (though from
>> where I know not).
>
> Fiddling with cib.xml is allowed only when heartbeat/CRM is not
> running. Otherwise, and that's prefered, use the CRM tools
> (crm_resource, cibadmin, etc).

Sorry - I should have been clearer.  I only made changes to an
exported cib.xml, and imported them using cibadmin.


>> I shouldn't be able to move the resource back to node2 - it still has
>> a failure count > 0.
>>
>> However, it seems I can - using crm_resource -M -r httpd_2 -H node2
>
> This inserts a -INFINITY location constraint...

Ok...

>> Ok - resetting the failcount to 0.  The cluster should be in the same
>> state it was before - let's try to kill apache.
>>
>> This time, apache seems to have restarted on node 2, and there was no
>> failover.  I don't understand this.  The failcount has gone back up to
>> 1, but the resource hasn't moved.
>
> ... which prevents it from even again starting on this node.
> crm_resource should have printed a warning about it.

I thought that was only if the -H some.host.name is omitted.

> crm_resource -U removes the -INFINITY constraint, hence now the
> cluster should start to behave as you expect it.

I'll try that.

>> So - what's going on - what have I got wrong?  Also could someone
>> please tell me the canonical way to reset the cluster, and import a
>> new cib.xml?
>
> cibadmin -R -x cib.xml should do (perhaps cibadmin -E before,
> can't recall anymore). It may happen that, if your old resource
> names don't exist in the new configuration, there will be some
> remnants in the status section of the CIB. Those can be removed
> by crm_resource -C or by restarting heartbeat.

Hrm - that's what I've been doing.

> Or stop the cluster, remove cib.xml and cib.xml.sig on all nodes
> (from /var/lib/heartbeat/crm), copy new cib.xml to all nodes,
> start cluster. Use crm_verify to make sure that your cib.xml is
> not broken.

Will try this approach.

S.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Issues with simple failover setup

Reply via email to