Hi! I think that pacemaker is logging too much all the time, so you hardly can find out if there really is a problem. For example external/sbd is logging a message every time the shared disk is OK, that is every 30s or so. In contrast (just for example) HP ServiceGuard only logs if the status of the lock disk changed. And of course, I wouldn't complain if I hadn't done it better long time ago: I was writing a script to monitor out SAP R/3 system that consists of multiple processes. The logs look like this: SAP-manager@rksapas02[26588](2011-01-14 13:57:03):I:(SM::service_status): Status of SAP R/3 C11 is up (1) SAP-manager@rksapas02[26588](2011-01-14 13:57:03):I:(monitor): Starting monitoring (5 restarts) as service SAP-manager@rksapas02[26588](2011-01-18 15:33:20):W:(sap_status): Lost 1 out of 11 PIDs: 4628 SAP-manager@rksapas02[26588](2011-01-18 15:33:20):I:(sap_status): Found 1 new PIDs: 16270 SAP-manager@rksapas02[26588](2011-01-18 15:48:16):W:(sap_status): Lost 1 out of 11 PIDs: 450 SAP-manager@rksapas02[26588](2011-01-18 15:48:16):I:(sap_status): Found 1 new PIDs: 21307 SAP-manager@rksapas02[26588](2011-01-20 09:04:46):W:(sap_status): Lost 1 out of 11 PIDs: 21307 SAP-manager@rksapas02[26588](2011-01-20 09:04:46):I:(sap_status): Found 1 new PIDs: 9404
Note that this is not syslog, but just a plain text file. You can see WHO is logging (SAP-manager@rksapas02[26588]), WHEN (2011-01-18 15:48:16), and the category (W=warning, I=info, etc.), and finally the message. Note that even the system is monitored every 30 seconds, there are only messges when something changed, that is about four messages a day. Here is an example when the system actually went down: SAP-manager@rksapas02[26588](2011-04-01 15:02:11):W:(sap_status): Lost 9 out of 11 PIDs: 770 4508 4509 4510 7618 7637 12012 20237 25092 SAP-manager@rksapas02[26588](2011-04-01 15:02:11):I:(SM::status_change): SAP R/3 C11 is stopping SAP-manager@rksapas02[26588](2011-04-01 15:02:11):W:(SM::monitor_service): SAP R/3 C11 not up (status is 0) SAP-manager@rksapas02[26588](2011-04-01 15:02:11):I:(SM::monitor_service): Restarting SAP R/3 C11 (4 restarts remaining) Note the fine difference between "being up", "going down" and "being down" (I'm using three-value logic inside) Seeing pacemaker logs, I feel the programmers just left their personal debugging messages in there which nobody really understands. An example: May 17 17:08:44 rksaph01 crmd: [9463]: notice: run_graph: Transition 41626 (Comp lete=2, Pending=0, Fired=0, Skipped=2, Incomplete=0, Source=/var/lib/pengine/pe-warn-134649.bz2): Stopped May 17 17:08:44 rksaph01 crmd: [9463]: info: te_graph_trigger: Transition 41626 is now complete May 17 17:08:44 rksaph01 crmd: [9463]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ] May 17 17:08:44 rksaph01 crmd: [9463]: info: do_state_transition: All 1 cluster nodes are eligible to run resources. May 17 17:08:44 rksaph01 crmd: [9463]: info: do_pe_invoke: Query 41656: Requesting the current CIB: S_POLICY_ENGINE May 17 17:08:44 rksaph01 crmd: [9463]: info: do_pe_invoke_callback: Invoking the PE: query=41656, ref=pe_calc-dc-1305644924-41640, seq=52, quorate=1 May 17 17:08:44 rksaph01 pengine: [9462]: notice: clone_print: Clone Set: cln_stonith_sbd [prm_stonith_sbd] May 17 17:08:44 rksaph01 pengine: [9462]: notice: short_print: Started: [ rksaph01 ] May 17 17:08:44 rksaph01 pengine: [9462]: notice: short_print: Stopped: [ prm_stonith_sbd:1 ] Regards, Ulrich >>> Dimitri Maziuk <[email protected]> schrieb am 17.05.2011 um 16:17 in Nachricht <[email protected]>: > On 5/17/2011 8:25 AM, Sascha Hagedorn wrote: > > Hi everyone, > ... > > - Pulled the HA network cable > > - Put it back after a couple of seconds > > > > Result: > > > > - Node 2 is being restarted > > - Load average on Node 1 increases until the system becomes > unreachable > > - A massive amount of log messages are being produced by OCFS2 on > the surviving node (see below) > > - The DRBD partition is not accessible > > - Node 1 cannot be rebooted only a hard reset brings it back to > life > > FWIW I've seen it on heartbeat-2.1.4 centos 5, too. Something sends it > into a tight loop spewing tons of log messages and using all CPU cycles > it can get. > > RHEL-5 kernels have a tendency to time out sockets under high CPU load, > so that effectively kills the entire cluster. > > Dima > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
