Dear Andrew,

Thanx for your response.

I see two options/conclusions on which I would like your feedback:

- Enable stonith so the attempt to start the resources on the third node, shall be 'naturally' disabled and therefore moved back to the first two nodes by the cluster software.

- Install Xen (and drbd) on the third node, so the cluster software get's a change to initialise some commands and get a proper answer to see that the resources don't belong here.


Kind regards,

Gerard.

Andrew Beekhof wrote:
On Thu, Jul 24, 2008 at 16:29, Gerard Petersen <[EMAIL PROTECTED]> wrote:
Hi all,

I'm trying to add a third node to a two node working cluster withresources
in the form of mirrored Xen (and underlying drbd) virtual servers. The two
node setup works great and as expected. (On failure, the drbd mirrors
switch master/slave roles, XenU's migrate automatically, etc). The goal is
to manually spread master slave combinations of the XenU's over the three
pysical nodes.

The third node is already added to heartbeat config, and in standby mode.
We have contraints in place (full log and config will follow), that work
with the +INF, 'zero' and -INF values, respectively as Master location,
Slave location and  'Never' location constraints.

When we take the third node online, where the current XenU's according to
the constraints are not allowed, the resources somehow all are moved to
the third node, where no xen or drbd is present yet. It seems some of the
constraints are completely ignored. We have tried this, among other
things, with the symmetric_cluster value True and False, but no luck.

Furthermore the log shows that the resources become 'to active', and after
that they become unmanaged.

When a new node joins the cluster, we check to see if its running any
of the cluster resources.
These checks occur regardless of any location constraints (precisely
so that we can enforce them for you).

What can happen however, is that these checks may fail.
Sometimes they fail because the service was unexpectedly found to be
active on the node.
Sometimes its because the resource agent (or the software it tries to
talk to) isnt installed.

in your case, it seems the RA is misbehaving and incorrectly telling
the cluster that the resources are active
eg.
             <lrm_rsc_op id="server128_monitor_0" operation="monitor"
crm-debug-origin="build_active_RAs"
transition_key="15:10:c195d63f-e91f-4162-8454-f6dde2c71ef1"
transition_magic="0:0;15:10:c195d63f-e91f-4162-8454-f6dde2c71ef1"
call_id="6" crm_feature_set="2.0" rc_code="0" op_status="0"
interval="0" op_digest="78122685b830dcb8197c65561be6d6a5"/>

rc_code="0" being the relevant piece of information

The cluster then thinks that the service is active on more than one
node and tries to recover.
But the RA then compounds the initial problem by failing to stop the service:

             <lrm_rsc_op id="server128_stop_0" operation="stop"
crm-debug-origin="build_active_RAs"
transition_key="25:11:c195d63f-e91f-4162-8454-f6dde2c71ef1"
transition_magic="0:1;25:11:c195d63f-e91f-4162-8454-f6dde2c71ef1"
call_id="12" crm_feature_set="2.0" rc_code="1" op_status="0"
interval="0" op_digest="78122685b830dcb8197c65561be6d6a5"/>

again, rc_code="1" being the part indicating failure.

at which point the cluster can do nothing (since stonith is disabled)

Some notes to clearify the setup (and make the log more readable):

We run heartbeat version 2.1.3-5~bpo40+1 from debian backports. At the
time of testing, one node was still on 2.1.3-2~bpo40+1.

Fysical nodes:
server010 (still to be added)
server011
server012

Virtual servers (the resources):
server128 - server133

All resources have contraints allowing a primary role on server011 and
secondary role on server012 (or viceversa). And are not allowed on
server010.

# Attached files are:

- cleancib.xml
The one we started of with.

- fullcib.xml
The most recent full dump (with counters etc. added by the cluster
software itself).

- syslog.clusterlog.080722.full(.tgz)
A cleaned up syslog wherein, with different values for symmetric_cluster,
the trail can be followed how all resources became to active, and end up
unmanaged on server010

- syslog.clusterlog.080722.part(.tgz)
A stripped version of the previous one with only one trail, hopefully
isolation enough information, for easier analyses.

It looks like the behaviour deviates from what the docs describe in
relation to the symmetric_cluster directive, or it's just a very ugly typo
somewhere .. :-)

I sincerely hope somebody can pinpoint the weakspot.

Thanx a lot!!


Kind regards,

Gerard.



--
~
~
:wq!
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems



--
urls
{'fun': 'www.zonderbroodje.nl', 'tech': 'www.gp-net.nl'}

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to