Andrew Beekhof <[email protected]> writes: > On 10/05/2013, at 11:37 PM, Ferenc Wagner <[email protected]> wrote: > >> An hour ago one node (n02) of our 4-node cluster started to shutdown. > > Someone, probably the init script, sent SIGTERM to pacemakerd.
Hi Andrew, thanks for the reply! Here I actually meant a full system shutdown, not a Pacemaker shutdown. To quote the previous part of our logs (the first three quoted messages are usual in normal system operation): May 10 13:59:15 n02 lrmd: [10857]: debug: rsc:dlm:2 monitor[12] (pid 524) May 10 13:59:15 n02 lrmd: [10857]: info: operation monitor[12] on dlm:2 for client 10860: pid 524 exited with return code 0 May 10 13:59:18 n02 corosync[10779]: [QUORUM] got quorate request on 0x187de10 May 10 13:59:41 n02 shutdown[657]: shutting down for system halt May 10 13:59:41 n02 init: Switching to runlevel: 0 May 10 13:59:41 n02 shutdown[674]: shutting down for system halt May 10 13:59:42 n02 logd: [786]: debug: Stopping ha_logd with pid 6449 May 10 13:59:42 n02 logd: [786]: info: Waiting for pid=6449 to exit May 10 13:59:42 n02 logd: [6449]: debug: logd_term_action: received SIGTERM May 10 13:59:42 n02 logd: [6449]: debug: logd_term_action: waiting for 0 messages to be read by write process May 10 13:59:42 n02 logd: [6449]: debug: logd_term_action: sending SIGTERM to write process May 10 13:59:42 n02 logd: [6484]: info: logd_term_write_action: received SIGTERM May 10 13:59:42 n02 logd: [6484]: debug: Writing out 0 messages then quitting May 10 13:59:42 n02 logd: [6484]: info: Exiting write process May 10 13:59:42 n02 stunnel: LOG5[32668:139795269060352]: Terminated May 10 13:59:42 n02 nrpe[6497]: Caught SIGTERM - shutting down... May 10 13:59:42 n02 nrpe[6497]: Daemon shutdown May 10 13:59:42 n02 rsyslogd-2177: imuxsock lost 4228 messages from pid 11083 due to rate-limiting May 10 13:59:42 n02 lvm[11083]: Got new connection on fd 5 May 10 13:59:42 n02 corosync[10779]: [QUORUM] got quorate request on 0x187de10 >> No idea why. The "shutting down for system halt" message is repeated, which makes me suspicious of some BMC/ACPI malfunction, but I can find nothing in the BMC event log, so this will be hard to confirm... >> But during shutdown, it asked another node (n01) to shut down as >> well: > > No it didn't. It asked n01 to perform an orderly shutdown (resources > first, then the pacemaker daemons) of n02. This would make perfect sense, and n01 certainly started to migrate the resources from n02: May 10 13:59:42 n01 pengine: [15536]: notice: stage6: Scheduling Node n02 for shutdown May 10 13:59:42 n01 pengine: [15536]: notice: LogActions: Stop dlm:2#011(n02) [...] May 10 13:59:42 n01 pengine: [15536]: notice: LogActions: Migrate vm-alder#011(Started n02 -> n04) But then suddenly: May 10 13:59:45 n01 shutdown[31186]: shutting down for system halt May 10 13:59:45 n01 init: Switching to runlevel: 0 [...] If I understand you right, this can't be the result of some Pacemaker action, which would be reassuring, as it does not make any sense to me. So it's again those pesky goblins to blame, who spent the lunch break throwing the power switches in our Fujutsu CX400 cluster box in the server room. :-/ I wonder what should happen in such a situation, anyway. As n01 was not running any cluster resources but clones, it shut down quickly, while n02 was stuck, trying to migrate some resources (big virtual machines) away. But I couldn't log onto it anymore, even the serial console did not give a login prompt, possibly because it also shut down, mostly. But the cluster was quorate, missing only n01... I'll dig deeper into the logs, but what should I expect? >> May 10 13:59:42 n02 pacemakerd: [10851]: info: crm_signal_dispatch: Invoking >> handler for signal 15: Terminated >> May 10 13:59:42 n02 pacemakerd: [10851]: notice: pcmk_shutdown_worker: >> Shuting down Pacemaker >> May 10 13:59:42 n02 pacemakerd: [10851]: notice: stop_child: Stopping crmd: >> Sent -15 to process 10860 >> May 10 13:59:42 n02 crmd: [10860]: info: crm_signal_dispatch: Invoking >> handler for signal 15: Terminated >> May 10 13:59:42 n02 crmd: [10860]: notice: crm_shutdown: Requesting >> shutdown, upper limit is 1200000ms >> May 10 13:59:42 n02 crmd: [10860]: debug: crm_timer_start: Started Shutdown >> Escalation (I_STOP:1200000ms), src=50 >> May 10 13:59:42 n02 crmd: [10860]: debug: s_crmd_fsa: Processing I_SHUTDOWN: >> [ state=S_NOT_DC cause=C_SHUTDOWN origin=crm_shutdown ] >> May 10 13:59:42 n02 crmd: [10860]: debug: do_fsa_action: actions:trace: >> #011// A_SHUTDOWN_REQ >> May 10 13:59:42 n02 crmd: [10860]: info: do_shutdown_req: Sending shutdown >> request to n01 >> >> Then hell broke loose, and I'm still pondering over the logs, > > Look for resource stop actions that failed. I probably disrupted the last migration operations by resetting n02, but otherwise the cluster recovered really nicely after I managed to start all nodes again. Thus stop actions did not fail, only migrate_froms. Another question surfaced though, by the apparent lack of monitoring: how are the action timeouts and intervals specified in the resource agent meta data used by Pacemaker? During configuration I got some warnings about unspecified timeout values being below the recommended values; is that all? No default timeouts are taken from the RA metadata, but some Pacemaker default is used instead? No monitor action is run at all if I don't specify one explicitly? Looks like that, but I wonder what that meta data is used for, exactly. -- Thanks, Feri. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
