Re: [Linux-HA] sometimes crm_resource -F fails

Serge Dubrouski Wed, 25 Jun 2008 08:30:09 -0700

On Wed, Jun 25, 2008 at 8:56 AM, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> On Wed, Jun 25, 2008 at 14:57, Serge Dubrouski <[EMAIL PROTECTED]> wrote:
>> On Wed, Jun 25, 2008 at 6:15 AM, Serge Dubrouski <[EMAIL PROTECTED]> wrote:
>>> On Wed, Jun 25, 2008 at 5:29 AM, Dominik Klein <[EMAIL PROTECTED]> wrote:
>>>> Junko IKEDA wrote:
>>>>>>>>>
>>>>>>>>> Unfortunately, the latest package produced the same results.
>>>>>>>>> pgsql couldn't fail over using crm_resource -F.
>>>>>>>>
>>>>>>>> I think you perhaps misunderstand what -F does... it is intended to
>>>>>>>> tell the cluster that the resource failed.
>>>>>>>> Although it may move as well (depending on how you set up the scores),
>>>>>>>> this is not the primary goal.
>>>>>>>
>>>>>>> pgsql is set as, moves to the other node if it fails.
>>>>>>> If crm_resrouce -F is called, pgsql's fail-count would be increased from
>>>>>
>>>>> 0
>>>>>>>
>>>>>>> to 1,
>>>>>>> so pgsql should move to the appropriate node.
>>>>>>> but pgsql was just stopped, and not moved.
>>>>>>> Other resources were still running.
>>>>>>
>>>>>> Ah ok, sorry just wanted to make sure the intended functionality was
>>>>>
>>>>> clear.
>>>>>>
>>>>>> I had a look at the report and analysis.txt highlights the problem quite
>>>>>
>>>>> well:
>>>>>>
>>>>>> pengine[20727]: 2008/06/23_11:02:40 ERROR: unpack_rsc_op: Hard error:
>>>>>> prmApPostgreSQLDB_fail_60000 failed with rc=2.
>>>>>> pengine[20727]: 2008/06/23_11:02:40 ERROR: unpack_rsc_op:   Preventing
>>>>>> prmApPostgreSQLDB from re-starting anywhere in the cluster
>>>>>>
>>>>>> It looks like the RA (incorrectly) returned 2 (invalid parameter),
>>>>>> instead of 3 (unimplemented function).
>>>>>> rc=2 tells the cluster that the configuration is invalid and not to
>>>>>> bother starting the resource elsewhere.
>>>>>
>>>>> !!! that means, there might be a problem at pgsql RA?
>>>>>
>>>>> Thanks,
>>>>> Junko
>>>>>
>>>>>
>>>>
>>>> http://hg.linux-ha.org/dev/file/42ce605e3da5/resources/OCF/pgsql
>>>>
>>>> Look at the end of the script.
>>>>
>>>> If it is invoked in any other way, it calls usage which exits OCF_ERR_ARGS
>>>> (ie 2). See how it was called. This should be the reason.
>>>>
>>>> I wonder how this could pass ocf-tester. It does not support any of the
>>>> notify operations nor validate-all nor meta-data.
>>>>
>>>> Or am I looking at the wrong file?
>>>
>>> You are looking at the right file, and I submitted a patch for this
>>> problem a couple of weeks ago.
>>>
>> And here is one more patch that fixes the problem. Also I have a
>> couple of questions:
>>
>> 1. What is 'fail' operation is supposed to do?
>
> "fail" :-)


That is to broad an explanation :-)

 I just wonder what would be the best implementation for fail action
in RA. In this "fixed" version pgsql just reports "NOT_IMPLEMENTED",
crm increases fail_count and if score still allows to keep a resource
on a current node nothing else happens. I suspect that one would
expect a resource to be moved from the current node when "crm_resource
-F" is called, but I don't know how to correctly implement that on a
RA level.
May be the best way would if CRM not just incrased failcount but set
it to a value high enough for failing a resource over to another node?
In this case RA would just stop a resource when it's called with
"fail" action.

>
>> 2. Why '-F' option isn't described in the help message for crm_resource
>
> an oversight i guess
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
Serge Dubrouski.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] sometimes crm_resource -F fails

Reply via email to