Re: [Linux-HA] Adding third node turns all resources unmanaged

Andrew Beekhof Mon, 28 Jul 2008 01:04:24 -0700

On Mon, Jul 28, 2008 at 09:34, Gerard Petersen <[EMAIL PROTECTED]> wrote:
> Dear Andrew,
>
> Nice one ... But I'm into python and not into C coding... ;-)


except its a bash script :)

>
> Seriously, where my conclusions far of,

no
Installing Xen and drbd on the third node is probably the simplest option

> because I'm a bit at a loss here.
>
> Thanx again.
>
> Regards,
>
> Gerard.
>
> Andrew Beekhof wrote:
>>
>> On Mon, Jul 28, 2008 at 08:56, Gerard Petersen <[EMAIL PROTECTED]> wrote:
>>>
>>> Dear Andrew,
>>>
>>> Thanx for your response.
>>>
>>> I see two options/conclusions on which I would like your feedback:
>>>
>>> - Enable stonith so the attempt to start the resources on the third node,
>>> shall be 'naturally' disabled and therefore moved back to the first two
>>> nodes by the cluster software.
>>>
>>> - Install Xen (and drbd) on the third node, so the cluster software get's
>>> a
>>> change to initialise some commands and get a proper answer to see that
>>> the
>>> resources don't belong here.
>>
>> I think you missed the most preferable option... fix the RA to return
>> OCF_NOT_INSTALLED in such cases and send us a patch :-)
>>
>>>
>>> Kind regards,
>>>
>>> Gerard.
>>>
>>> Andrew Beekhof wrote:
>>>>
>>>> On Thu, Jul 24, 2008 at 16:29, Gerard Petersen <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm trying to add a third node to a two node working cluster
>>>>> withresources
>>>>> in the form of mirrored Xen (and underlying drbd) virtual servers. The
>>>>> two
>>>>> node setup works great and as expected. (On failure, the drbd mirrors
>>>>> switch master/slave roles, XenU's migrate automatically, etc). The goal
>>>>> is
>>>>> to manually spread master slave combinations of the XenU's over the
>>>>> three
>>>>> pysical nodes.
>>>>>
>>>>> The third node is already added to heartbeat config, and in standby
>>>>> mode.
>>>>> We have contraints in place (full log and config will follow), that
>>>>> work
>>>>> with the +INF, 'zero' and -INF values, respectively as Master location,
>>>>> Slave location and  'Never' location constraints.
>>>>>
>>>>> When we take the third node online, where the current XenU's according
>>>>> to
>>>>> the constraints are not allowed, the resources somehow all are moved to
>>>>> the third node, where no xen or drbd is present yet. It seems some of
>>>>> the
>>>>> constraints are completely ignored. We have tried this, among other
>>>>> things, with the symmetric_cluster value True and False, but no luck.
>>>>>
>>>>> Furthermore the log shows that the resources become 'to active', and
>>>>> after
>>>>> that they become unmanaged.
>>>>>
>>>> When a new node joins the cluster, we check to see if its running any
>>>> of the cluster resources.
>>>> These checks occur regardless of any location constraints (precisely
>>>> so that we can enforce them for you).
>>>>
>>>> What can happen however, is that these checks may fail.
>>>> Sometimes they fail because the service was unexpectedly found to be
>>>> active on the node.
>>>> Sometimes its because the resource agent (or the software it tries to
>>>> talk to) isnt installed.
>>>>
>>>> in your case, it seems the RA is misbehaving and incorrectly telling
>>>> the cluster that the resources are active
>>>> eg.
>>>>            <lrm_rsc_op id="server128_monitor_0" operation="monitor"
>>>> crm-debug-origin="build_active_RAs"
>>>> transition_key="15:10:c195d63f-e91f-4162-8454-f6dde2c71ef1"
>>>> transition_magic="0:0;15:10:c195d63f-e91f-4162-8454-f6dde2c71ef1"
>>>> call_id="6" crm_feature_set="2.0" rc_code="0" op_status="0"
>>>> interval="0" op_digest="78122685b830dcb8197c65561be6d6a5"/>
>>>>
>>>> rc_code="0" being the relevant piece of information
>>>>
>>>> The cluster then thinks that the service is active on more than one
>>>> node and tries to recover.
>>>> But the RA then compounds the initial problem by failing to stop the
>>>> service:
>>>>
>>>>            <lrm_rsc_op id="server128_stop_0" operation="stop"
>>>> crm-debug-origin="build_active_RAs"
>>>> transition_key="25:11:c195d63f-e91f-4162-8454-f6dde2c71ef1"
>>>> transition_magic="0:1;25:11:c195d63f-e91f-4162-8454-f6dde2c71ef1"
>>>> call_id="12" crm_feature_set="2.0" rc_code="1" op_status="0"
>>>> interval="0" op_digest="78122685b830dcb8197c65561be6d6a5"/>
>>>>
>>>> again, rc_code="1" being the part indicating failure.
>>>>
>>>> at which point the cluster can do nothing (since stonith is disabled)
>>>>
>>>>
>>>>> Some notes to clearify the setup (and make the log more readable):
>>>>>
>>>>> We run heartbeat version 2.1.3-5~bpo40+1 from debian backports. At the
>>>>> time of testing, one node was still on 2.1.3-2~bpo40+1.
>>>>>
>>>>> Fysical nodes:
>>>>> server010 (still to be added)
>>>>> server011
>>>>> server012
>>>>>
>>>>> Virtual servers (the resources):
>>>>> server128 - server133
>>>>>
>>>>> All resources have contraints allowing a primary role on server011 and
>>>>> secondary role on server012 (or viceversa). And are not allowed on
>>>>> server010.
>>>>>
>>>>> # Attached files are:
>>>>>
>>>>> - cleancib.xml
>>>>> The one we started of with.
>>>>>
>>>>> - fullcib.xml
>>>>> The most recent full dump (with counters etc. added by the cluster
>>>>> software itself).
>>>>>
>>>>> - syslog.clusterlog.080722.full(.tgz)
>>>>> A cleaned up syslog wherein, with different values for
>>>>> symmetric_cluster,
>>>>> the trail can be followed how all resources became to active, and end
>>>>> up
>>>>> unmanaged on server010
>>>>>
>>>>> - syslog.clusterlog.080722.part(.tgz)
>>>>> A stripped version of the previous one with only one trail, hopefully
>>>>> isolation enough information, for easier analyses.
>>>>>
>>>>> It looks like the behaviour deviates from what the docs describe in
>>>>> relation to the symmetric_cluster directive, or it's just a very ugly
>>>>> typo
>>>>> somewhere .. :-)
>>>>>
>>>>> I sincerely hope somebody can pinpoint the weakspot.
>>>>>
>>>>> Thanx a lot!!
>>>>>
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> Gerard.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ~
>>>>> ~
>>>>> :wq!
>>>>> _______________________________________________
>>>>> Linux-HA mailing list
>>>>> [email protected]
>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>>> See also: http://linux-ha.org/ReportingProblems
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> [email protected]
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>>>
>>>>
>>>
>>> --
>>>>>>
>>>>>> urls
>>>
>>> {'fun':  'www.zonderbroodje.nl',  'tech':  'www.gp-net.nl'}
>>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>>
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
>
> --
>>>> urls
> {'fun':  'www.zonderbroodje.nl',  'tech':  'www.gp-net.nl'}
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Adding third node turns all resources unmanaged

Reply via email to