Hi!

I think that pacemaker is logging too much all the time, so you hardly can find 
out if there really is a problem. For example external/sbd is logging a message 
every time the shared disk is OK, that is every 30s or so.
In contrast (just for example) HP ServiceGuard only logs if the status of the 
lock disk changed.
And of course, I wouldn't complain if I hadn't done it better long time ago:
I was writing a script to monitor out SAP R/3 system that consists of multiple 
processes. The logs look like this:
SAP-manager@rksapas02[26588](2011-01-14 13:57:03):I:(SM::service_status): 
Status of SAP R/3 C11 is up (1)
SAP-manager@rksapas02[26588](2011-01-14 13:57:03):I:(monitor): Starting 
monitoring (5 restarts) as service
SAP-manager@rksapas02[26588](2011-01-18 15:33:20):W:(sap_status): Lost 1 out of 
11 PIDs: 4628
SAP-manager@rksapas02[26588](2011-01-18 15:33:20):I:(sap_status): Found 1 new 
PIDs: 16270
SAP-manager@rksapas02[26588](2011-01-18 15:48:16):W:(sap_status): Lost 1 out of 
11 PIDs: 450
SAP-manager@rksapas02[26588](2011-01-18 15:48:16):I:(sap_status): Found 1 new 
PIDs: 21307
SAP-manager@rksapas02[26588](2011-01-20 09:04:46):W:(sap_status): Lost 1 out of 
11 PIDs: 21307
SAP-manager@rksapas02[26588](2011-01-20 09:04:46):I:(sap_status): Found 1 new 
PIDs: 9404

Note that this is not syslog, but just a plain text file. You can see WHO is 
logging (SAP-manager@rksapas02[26588]), WHEN (2011-01-18 15:48:16), and the 
category (W=warning, I=info, etc.), and finally the message. Note that even the 
system is monitored every 30 seconds, there are only messges when something 
changed, that is about four messages a day.

Here is an example when the system actually went down:
SAP-manager@rksapas02[26588](2011-04-01 15:02:11):W:(sap_status): Lost 9 out of 
11 PIDs: 770 4508 4509 4510 7618 7637 12012 20237 25092
SAP-manager@rksapas02[26588](2011-04-01 15:02:11):I:(SM::status_change): SAP 
R/3  C11 is stopping
SAP-manager@rksapas02[26588](2011-04-01 15:02:11):W:(SM::monitor_service): SAP 
R/3 C11 not up (status is 0)
SAP-manager@rksapas02[26588](2011-04-01 15:02:11):I:(SM::monitor_service): 
Restarting SAP R/3 C11 (4 restarts remaining)

Note the fine difference between "being up", "going down" and "being down" (I'm 
using three-value logic inside)

Seeing pacemaker logs, I feel the programmers just left their personal 
debugging messages in there which nobody really understands. An example:

May 17 17:08:44 rksaph01 crmd: [9463]: notice: run_graph: Transition 41626 (Comp
lete=2, Pending=0, Fired=0, Skipped=2, Incomplete=0, 
Source=/var/lib/pengine/pe-warn-134649.bz2): Stopped
May 17 17:08:44 rksaph01 crmd: [9463]: info: te_graph_trigger: Transition 41626 
is now complete
May 17 17:08:44 rksaph01 crmd: [9463]: info: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=notify_crmd ]
May 17 17:08:44 rksaph01 crmd: [9463]: info: do_state_transition: All 1 cluster 
nodes are eligible to run resources.
May 17 17:08:44 rksaph01 crmd: [9463]: info: do_pe_invoke: Query 41656: 
Requesting the current CIB: S_POLICY_ENGINE
May 17 17:08:44 rksaph01 crmd: [9463]: info: do_pe_invoke_callback: Invoking 
the PE: query=41656, ref=pe_calc-dc-1305644924-41640, seq=52, quorate=1
May 17 17:08:44 rksaph01 pengine: [9462]: notice: clone_print:  Clone Set: 
cln_stonith_sbd [prm_stonith_sbd]
May 17 17:08:44 rksaph01 pengine: [9462]: notice: short_print:      Started: [ 
rksaph01 ]
May 17 17:08:44 rksaph01 pengine: [9462]: notice: short_print:      Stopped: [ 
prm_stonith_sbd:1 ]

Regards,
Ulrich

>>> Dimitri Maziuk <[email protected]> schrieb am 17.05.2011 um 16:17 in
Nachricht <[email protected]>:
> On 5/17/2011 8:25 AM, Sascha Hagedorn wrote:
> > Hi everyone,
> ...
> > -          Pulled the HA network cable
> > -          Put it back after a couple of seconds
> >
> > Result:
> >
> > -          Node 2 is being restarted
> > -          Load average on Node 1 increases until the system becomes 
> unreachable
> > -          A massive amount of log messages are being produced by OCFS2 on 
> the surviving node (see below)
> > -          The DRBD partition is not accessible
> > -          Node 1 cannot be rebooted only a hard reset brings it back to 
> life
> 
> FWIW I've seen it on heartbeat-2.1.4 centos 5, too. Something sends it 
> into a tight loop spewing tons of log messages and using all CPU cycles 
> it can get.
> 
> RHEL-5 kernels have a tendency to time out sockets under high CPU load, 
> so that effectively kills the entire cluster.
> 
> Dima
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected] 
> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> See also: http://linux-ha.org/ReportingProblems 
> 

 
 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to