Re: [Pacemaker] Service restoration in clone resource group

Sean Lutner Tue, 15 Oct 2013 16:46:11 -0700

On Oct 15, 2013, at 6:21 PM, Andrew Beekhof <and...@beekhof.net> wrote:


> 
> On 10/10/2013, at 12:52 PM, Sean Lutner <s...@rentul.net> wrote:
> 
>> 
>> On Oct 8, 2013, at 9:45 AM, Sean Lutner <s...@rentul.net> wrote:
>> 
>>> 
>>> On Oct 8, 2013, at 9:33 AM, Lars Marowsky-Bree <l...@suse.com> wrote:
>>> 
>>>> On 2013-10-08T09:29:14, Sean Lutner <s...@rentul.net> wrote:
>>>> 
>>>>> The clone was created using the interleave=true option, yes. 
> 
> You might want to trawl the raw xml to make sure pcs did the right thing.
>   cibadmin -Ql | grep interleave
> 
> would tell you.

Thanks, that's very helpful. I'll have a look.

> 
>>>> 
>>>> Ok, so pcs hides that (interesting to know).
>>>> 
>>>>> Does this have an affect on what I'm trying to accomplish?
>>>> 
>>>> Yes, if you hadn't set that, it might have been an explanation. My best
>>>> guess right now would be to upgrade first; the PE has gotten quite a few
>>>> fixes since 1.1.8 again.
>>> 
>>> Are you indicating that the behavior I expect to see, which is the resource 
>>> being marked as Started on the now passive node, is what pacemaker should 
>>> be doing and this could be a bug?
>>> 
>>> If it would help, I can provide a full cib configuration and logs while I 
>>> execute the tests I've been running. I won't be able to do that until 
>>> tonight (EST time) but can if it may help.
>>> 
>>> Thanks
>>> Sean
>> 
>> Sorry for following up on my own post but I have a follow-up question about 
>> the failcount for a resource. Does a crm_resource --cleanup erase the 
>> failcount on the resource it's run against?
> 
> Older versions didn't but I don't exactly recall when we started doing that.

In practice that's what I'm observing so it seems that with 1.1.8 it does.

> 
>> I'm looking at making changes to the failure-timeout and 
>> cluster-recheck-interval which when combined with my values of 
>> resource-stickiness=100 and migration-threshold=1 should allow for the 
>> services on the now failed node to be restarted and be marked as Started in 
>> the cluster without causing an unnecessary failover.
>> 
>> Does this make sense?
> 
> yes

I currently have my failure-timeout and cluster-recheck-interval both set to 
10m but I'm not seeing the failcount clear.  If I trigger a failover by 
stopping the resource/service the failover works as expected. But if I then 
manually restart the services on a previously failed node pacemaker never marks 
the resources as Started again.

I think I may be hitting this bug you fixed back in May. The commit for the fix 
is https://github.com/beekhof/pacemaker/commit/d87de1b and the thread 
discussing the issue is 
http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg15979.html.

I think that fits and is what I'm seeing because the default on-fail behavior 
for a stop operation is block.

I will be pulling a newer version of pacemaker from git and building an RPM to 
test with.

> 
>> 
>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> Lars
>>>> 
>>>> -- 
>>>> Architect Storage/HA
>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix 
>>>> Imendörffer, HRB 21284 (AG Nürnberg)
>>>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Service restoration in clone resource group

Reply via email to