On Wed, Jun 25, 2008 at 11:06, Ivan <[EMAIL PROTECTED]> wrote: > Hi, > > I had a serious problem with HA today on SLES10SP2 and HA 2.1.3. I am > looking for somebody who could help me to find out what went wrong with > a setup which worked for more than a year (HASI setup). > > What happened briefly? > -SERVER1 ran YUPVM and PRINTVM the SERVER2 ran NFSVM > -I put SERVER2 into standby, NFSVM got migrated to SERVER1 fine, all > services stopped fine on SERVER2. > -~5 seconds later the other YUPVM, PRINTVM previously perfectly running > on the active SERVER1 stopped, in fact disappeared, the cluster stopped > most of its services on the active SERVER1 node.
They were stopped because the imagepool resource failed to stop on SERVER1 and SERVER2 is in standby (thus unable to run resources). So there was nowhere left to run them. The reason it was asked to stop is probably due to a bug in the old version. > and reported all 3 VMs > being stopped but the NFSVM I migrated from the standby one was still > running!!! According to who? The cluster lists it as failed (due to the failed start action). Please use hb_report next time... it gathers all the info needed to figure out what went wrong. > The chaos a bit later what you might find in the logs was caused by atd > not running hence STONITH failed. I know about that by now. While I was > testing some stuff I didn't want STONITH to kill my box and it got > forgotten BUT the question is what was wrong with SERVER1 and why it had > to stop its resources after migration? > > What has changed? > -I updated the SP1 SLES systems (dom0-domUs) to SP2. Got Xen 3.2+HA > 2.1.3 > -Changed from real block device to file imaged based setup and used > blktap driver on top of ocfs2 image pool. > > Other than this (which is quite a bit itself) the setup I used was fine. > > Relevant points: > > pengine[13601]: 2008/06/25_12:11:51 info: unpack_nodes: Node SERVER2 is > in standby-mode > pengine[13601]: 2008/06/25_12:11:51 info: determine_online_status: Node > SERVER1 is online > pengine[13601]: 2008/06/25_12:11:51 info: determine_online_status: Node > SERVER2 is standby > > I don't unserstand these really: > > pengine[13601]: 2008/06/25_12:11:51 WARN: native_color: Resource > STONITH-child:1 cannot run anywhere > pengine[13601]: 2008/06/25_12:11:51 WARN: native_color: Resource > pingd-child:1 cannot run anywhere > pengine[13601]: 2008/06/25_12:11:51 WARN: native_color: Resource > evmsd-child:1 cannot run anywhere > pengine[13601]: 2008/06/25_12:11:51 WARN: native_color: Resource > evms-child:1 cannot run anywhere > pengine[13601]: 2008/06/25_12:11:51 WARN: native_color: Resource > imagepool-child:1 cannot run anywhere > pengine[13601]: 2008/06/25_12:11:51 WARN: native_color: Resource > configpool-child:1 cannot run anywhere > > What was wrong with the other 2 VMs I cannot really understand but this > follows: > > pengine[13601]: 2008/06/25_12:11:51 notice: complex_migrate_reload: > Migrating NFSVM from SERVER2 to SERVER1 > pengine[13601]: 2008/06/25_12:11:51 WARN: process_pe_message: Transition > 29: WARNINGs found during PE processing. PEngine Input stored > in: /var/lib/heartbeat/pengine/pe-warn-1.bz2 > pengine[13601]: 2008/06/25_12:11:51 info: process_pe_message: > Configuration WARNINGs found during PE processing. Please run > "crm_verify -L" to identify issues. > crmd[12993]: 2008/06/25_12:11:51 info: do_state_transition: State > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > cause=C_IPC_MESSAGE origin=route_message ] > tengine[13600]: 2008/06/25_12:11:51 info: unpack_graph: Unpacked > transition 29: 75 actions in 75 synapses > tengine[13600]: 2008/06/25_12:11:51 info: te_pseudo_action: Pseudo > action 23 fired and confirmed > tengine[13600]: 2008/06/25_12:11:51 info: te_pseudo_action: Pseudo > action 30 fired and confirmed > tengine[13600]: 2008/06/25_12:11:51 info: te_pseudo_action: Pseudo > action 50 fired and confirmed > tengine[13600]: 2008/06/25_12:11:51 info: te_pseudo_action: Pseudo > action 65 fired and confirmed > tengine[13600]: 2008/06/25_12:11:51 info: te_pseudo_action: Pseudo > action 80 fired and confirmed > tengine[13600]: 2008/06/25_12:11:51 info: send_rsc_command: Initiating > action 84: PRINTVM_stop_0 on SERVER1 > tengine[13600]: 2008/06/25_12:11:51 info: send_rsc_command: Initiating > action 86: NFSVM_migrate_to_0 on SERVER2 > tengine[13600]: 2008/06/25_12:11:51 info: send_rsc_command: Initiating > action 89: YUPVM_stop_0 on SERVER1 > > I get few of these I cannot interpret: > > lrmd[7117]: 2008/06/25_12:45:43 notice: on_msg_get_metadata: can not > find the class #default. > > And: > > lrmd[7117]: 2008/06/25_14:30:42 WARN: G_SIG_dispatch: Dispatch function > for SIGCHLD was delayed 200 ms (> 100 ms) before being called (GSource: > 0x805dac0) > lrmd[7117]: 2008/06/25_14:30:42 info: G_SIG_dispatch: started at > 1718784691 should have started at 1718784671 > > > Somebody please... > > Thanks in advance, > Ivan > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
