On Thu, Jul 24, 2008 at 16:29, Gerard Petersen <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I'm trying to add a third node to a two node working cluster withresources
> in the form of mirrored Xen (and underlying drbd) virtual servers. The two
> node setup works great and as expected. (On failure, the drbd mirrors
> switch master/slave roles, XenU's migrate automatically, etc). The goal is
> to manually spread master slave combinations of the XenU's over the three
> pysical nodes.
>
> The third node is already added to heartbeat config, and in standby mode.
> We have contraints in place (full log and config will follow), that work
> with the +INF, 'zero' and -INF values, respectively as Master location,
> Slave location and  'Never' location constraints.
>
> When we take the third node online, where the current XenU's according to
> the constraints are not allowed, the resources somehow all are moved to
> the third node, where no xen or drbd is present yet. It seems some of the
> constraints are completely ignored. We have tried this, among other
> things, with the symmetric_cluster value True and False, but no luck.
>
> Furthermore the log shows that the resources become 'to active', and after
> that they become unmanaged.

When a new node joins the cluster, we check to see if its running any
of the cluster resources.
These checks occur regardless of any location constraints (precisely
so that we can enforce them for you).

What can happen however, is that these checks may fail.
Sometimes they fail because the service was unexpectedly found to be
active on the node.
Sometimes its because the resource agent (or the software it tries to
talk to) isnt installed.

in your case, it seems the RA is misbehaving and incorrectly telling
the cluster that the resources are active
eg.
             <lrm_rsc_op id="server128_monitor_0" operation="monitor"
crm-debug-origin="build_active_RAs"
transition_key="15:10:c195d63f-e91f-4162-8454-f6dde2c71ef1"
transition_magic="0:0;15:10:c195d63f-e91f-4162-8454-f6dde2c71ef1"
call_id="6" crm_feature_set="2.0" rc_code="0" op_status="0"
interval="0" op_digest="78122685b830dcb8197c65561be6d6a5"/>

rc_code="0" being the relevant piece of information

The cluster then thinks that the service is active on more than one
node and tries to recover.
But the RA then compounds the initial problem by failing to stop the service:

             <lrm_rsc_op id="server128_stop_0" operation="stop"
crm-debug-origin="build_active_RAs"
transition_key="25:11:c195d63f-e91f-4162-8454-f6dde2c71ef1"
transition_magic="0:1;25:11:c195d63f-e91f-4162-8454-f6dde2c71ef1"
call_id="12" crm_feature_set="2.0" rc_code="1" op_status="0"
interval="0" op_digest="78122685b830dcb8197c65561be6d6a5"/>

again, rc_code="1" being the part indicating failure.

at which point the cluster can do nothing (since stonith is disabled)

>
> Some notes to clearify the setup (and make the log more readable):
>
> We run heartbeat version 2.1.3-5~bpo40+1 from debian backports. At the
> time of testing, one node was still on 2.1.3-2~bpo40+1.
>
> Fysical nodes:
> server010 (still to be added)
> server011
> server012
>
> Virtual servers (the resources):
> server128 - server133
>
> All resources have contraints allowing a primary role on server011 and
> secondary role on server012 (or viceversa). And are not allowed on
> server010.
>
> # Attached files are:
>
> - cleancib.xml
> The one we started of with.
>
> - fullcib.xml
> The most recent full dump (with counters etc. added by the cluster
> software itself).
>
> - syslog.clusterlog.080722.full(.tgz)
> A cleaned up syslog wherein, with different values for symmetric_cluster,
> the trail can be followed how all resources became to active, and end up
> unmanaged on server010
>
> - syslog.clusterlog.080722.part(.tgz)
> A stripped version of the previous one with only one trail, hopefully
> isolation enough information, for easier analyses.
>
> It looks like the behaviour deviates from what the docs describe in
> relation to the symmetric_cluster directive, or it's just a very ugly typo
> somewhere .. :-)
>
> I sincerely hope somebody can pinpoint the weakspot.
>
> Thanx a lot!!
>
>
> Kind regards,
>
> Gerard.
>
>
>
> --
> ~
> ~
> :wq!
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to