Re: [Linux-HA] Antw: Re: "crm manage prm" vs. "crm_mon"

Tim Serong Mon, 04 Jul 2011 22:29:45 -0700

On 04/07/11 23:38, Ulrich Windl wrote:
>>>> Tim Serong<[email protected]>  schrieb am 04.07.2011 um 15:27 in Nachricht
> <[email protected]>:
>> On 04/07/11 23:16, Ulrich Windl wrote:
>>>>>> Tim Serong<[email protected]>   schrieb am 04.07.2011 um 13:34 in 
>>>>>> Nachricht
>>> <[email protected]>:
>>>> On 04/07/11 19:48, Ulrich Windl wrote:
>>>>> Hi!
>>>>>
>>>>> This was found in SLES11 SP1 (Version:
>>>> 1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60): A resource is being
>>>> displayed as "(unmanaged) FAILED".
>>>>> I used "crm resource manage prm" to set the resource back to managed mode.
>>>> However the resource is still displayed as "unmanaged" by "crm_mon". When
>>>> inspecting the resource with "crm configure", the attribute is there as
>> 'meta
>>>> is-managed="true"'. So I guess the change in the CIB did not make ist way 
>>>> to
>>>> crm_mon. Don't ask me how or why; I'm asking you.
>>>>
>>>> I'd guess the cluster attempted to stop the resource for some reason,
>>>> but the stop failed, and STONITH is not configured.  In this situation,
>>>> the cluster can't manage the resource (it's not safely/cleanly stopped,
>>>> and there's no way to kill the node it was running on to be sure).
>>>
>>> Hi Tim!
>>>
>>> You are correct: When I had STONITH enabled both nodes were periodically
>> rebooting. That was not fun. I'm trying to find out what's going on. Not as
>> easy as I'd wish...
>>>
>>> I feel CRM is in "insulted mode": It does very little with failed
>> resources. Do I really have to reboot the node to enable resource management?
>>
>> If "stop" fails, there's not much it can do, because in the worst case,
>> there's no safe way to recover from that situation.  On that note, you
>> might find http://ourobengr.com/ha useful.
>
> Hi!
>
> As I wrote before, I come from HP ServiceGuard. There there is an 
> intermediate state "starting" between "stopped" and "started", and there is 
> an intermediate state "stopping" between "started" and "stopping". With 
> Pacemaker, resources just change from one extreme to the other, and you 
> cannot really see (from "crm_mon") which actions are currently running.


There is an option for Pacemaker to record pending ops.  Using the CRM 
shell:

   # crm configure op_defaults record-pending=true

This means a pending start or stop (e.g.: "starting", "stopping") will 
be recorded in the CIB while it's happening.  This state is 
unfortunately not exposed in the default crm_mon view (it will still 
show "started" while it's "starting"), but you can see it if you ask 
crm_mon to show you the resource op history:

   # crm_mon --operations

This can get a bit long (it shows all ops for all resources).  For a 
resource that's starting, you'll see something like:

   Operations:
   * Node node-1:
     foo:
     + (-1) start: rc=14 (status: unknown)

The "(-1) start" indicates a pending start.  You can also try:

   # crm_resource -O --resource foo

This will give something like:

   Started : foo_start_0 (node=node-1, call=-1, rc=14): pending
   Started : foo_monitor_0 (node=node-0, call=29, rc=7): complete

Here we see a probe, followed by a pending start.  Later, once it's 
running, you'll see something like:

   Started : foo_start_0 (node=node-1, call=55, rc=0): complete
   Started : foo_monitor_30000 (node=node-1, call=56, rc=0): complete
   Started : foo_monitor_0 (node=node-0, call=29, rc=7): complete

That's probe, start, then recurring monitor (it's in order by call 
number: 29, 55, 56 in this case).

Pending ops are exposed in a somewhat more friendly fashion in the 
Python GUI (resources show as "starting" or "stopping"), and in Hawk 
(resources show as "pending").

>> That being said, if *you* are looking at the system and you know the
>> resource is cleanly stopped (even though the cluster failed to stop it
>> for some reason), try "crm resource cleanup prm" and see if it comes
>> good again.  Or, restart corosync/openais on that node.  But!  Check the
>> logs to see why the stop failed in the first place, and fix that :)
>
> That's the big problem: it's extremely hard to find out what made the 
> resource's action fail actually.

It's all (or should be all) in syslog, but it can be tricky to find. 
You want to look for messages from lrmd, and the resource agent in 
question.  Here's a (slightly trimmed) example, using a broken CTDB I 
happened to have lying around:

> lrmd: [5265]: info: rsc:ctdb:43: start
> crmd: [5268]: info: te_rsc_command: Initiating action 5: probe_complete 
> probe_complete on node-0 - no waiting
> crmd: [5268]: info: te_pseudo_action: Pseudo action 4 fired and confirmed
> crmd: [5268]: info: te_rsc_command: Initiating action 36: start ctdb_start_0 
> on node-1 (local)
> crmd: [5268]: info: do_lrm_rsc_op: Performing 
> key=36:94:0:092b2244-7df1-4a9f-88f2-37aa1a84737e op=ctdb_start_0 )
> crmd: [5268]: info: te_rsc_command: Recording pending op ctdb_start_0 in the 
> CIB
> crmd: [5268]: info: create_operation_update: cib_action_update: Updating 
> resouce ctdb after pending start op (interval=0)
> CTDB[7870]: [7885]: ERROR: /etc/ctdb/nodes does not exist.
> crmd: [5268]: info: process_lrm_event: LRM operation ctdb_start_0 (call=43, 
> rc=2, cib-update=343, confirmed=true) invalid parameter
> attrd: [5266]: info: attrd_trigger_update: Sending flush op to all hosts for: 
> fail-count-ctdb (INFINITY)
> crmd: [5268]: WARN: status_from_rc: Action 36 (ctdb_start_0) on node-1 failed 
> (target: 0 vs. rc: 2): Error

The above is the relevant part of /var/log/messages for a failed start 
of my CTDB resource.  Things to note from the above are:

> lrmd: [5265]: info: rsc:ctdb:43: start

That's lrmd trying to start the resource.

> CTDB[7870]: [7885]: ERROR: /etc/ctdb/nodes does not exist.

That's the CTDB RA complaining about a misconfiguration (which is what I 
need to fix).

> crmd: [5268]: info: process_lrm_event: LRM operation ctdb_start_0 (call=43, 
> rc=2, cib-update=343, confirmed=true) invalid parameter
> crmd: [5268]: WARN: status_from_rc: Action 36 (ctdb_start_0) on node-1 failed 
> (target: 0 vs. rc: 2): Error

And that's Pacemaker noticing that the op failed.

To use another example, if you've got an apache resource, you'll have 
error messages from the apache RA, but the form of the above should 
remain roughly the same.  You'll also want to check the logs for the 
particular resource in question (e.g.: for apache, /var/log/apache or 
/var/log/httpd or whatever it is).

> Another related question: In HP ServiceGuard you usually have one log file 
> per "package" (which is like a resource group of pacemaker). Can something 
> similar be configured with pacemaker?

It all goes to syslog - you can maybe configure syslog to put different 
messages from different daemons or log levels into different files, but 
it's been a while since I tried doing that, so can't offer much advice 
there.

Regards,

Tim
-- 
Tim Serong <[email protected]>
Senior Clustering Engineer, OPS Engineering, Novell Inc.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: "crm manage prm" vs. "crm_mon"

Reply via email to