>>> Tim Serong <[email protected]> schrieb am 04.07.2011 um 15:27 in Nachricht <[email protected]>: > On 04/07/11 23:16, Ulrich Windl wrote: >>>>> Tim Serong<[email protected]> schrieb am 04.07.2011 um 13:34 in >>>>> Nachricht > > <[email protected]>: > >> On 04/07/11 19:48, Ulrich Windl wrote: > >>> Hi! > >>> > >>> This was found in SLES11 SP1 (Version: > >> 1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60): A resource is being > >> displayed as "(unmanaged) FAILED". > >>> I used "crm resource manage prm" to set the resource back to managed mode. > >> However the resource is still displayed as "unmanaged" by "crm_mon". When > >> inspecting the resource with "crm configure", the attribute is there as > 'meta > >> is-managed="true"'. So I guess the change in the CIB did not make ist way > >> to > >> crm_mon. Don't ask me how or why; I'm asking you. > >> > >> I'd guess the cluster attempted to stop the resource for some reason, > >> but the stop failed, and STONITH is not configured. In this situation, > >> the cluster can't manage the resource (it's not safely/cleanly stopped, > >> and there's no way to kill the node it was running on to be sure). > > > > Hi Tim! > > > > You are correct: When I had STONITH enabled both nodes were periodically > rebooting. That was not fun. I'm trying to find out what's going on. Not as > easy as I'd wish... > > > > I feel CRM is in "insulted mode": It does very little with failed > resources. Do I really have to reboot the node to enable resource management? > > If "stop" fails, there's not much it can do, because in the worst case, > there's no safe way to recover from that situation. On that note, you > might find http://ourobengr.com/ha useful.
Hi! As I wrote before, I come from HP ServiceGuard. There there is an intermediate state "starting" between "stopped" and "started", and there is an intermediate state "stopping" between "started" and "stopping". With Pacemaker, resources just change from one extreme to the other, and you cannot really see (from "crm_mon") which actions are currently running. > > That being said, if *you* are looking at the system and you know the > resource is cleanly stopped (even though the cluster failed to stop it > for some reason), try "crm resource cleanup prm" and see if it comes > good again. Or, restart corosync/openais on that node. But! Check the > logs to see why the stop failed in the first place, and fix that :) That's the big problem: it's extremely hard to find out what made the resource's action fail actually. Another related question: In HP ServiceGuard you usually have one log file per "package" (which is like a resource group of pacemaker). Can something similar be configured with pacemaker? Regards, Ulrich _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
