By default it monitors template1. It can be adjusted with "pgdb" parameter. All that monitors function does is trying to run "select now();" against monitored database. If monitor function times out that means that following command didn't finish in time.
$OCF_RESKEY_psql -p $OCF_RESKEY_pgport -U $OCF_RESKEY_pgdba OCF_RESKEY_pgdb -c 'select now(); Can you check that your system wasn't overloaded at the time when it happened? It's possible that there was a heavy cronjob or just a heavy user request against your database. On Thu, Mar 4, 2010 at 10:42 AM, Brian Witt <[email protected]> wrote: > Hi, > > Thanks to Andreas for helping me to get our systems upgraded to > heartbeat 3.0.2. But the underlying reason for upgrading is still > there. We're running CentOS 5.4 x86_64, kernel-2.6.18-164.11.1.el5, > heartbeat-3.0.2-2.el5, pacemaker-1.0.7-4.el5, and > postgresql-server-8.1.18-2.el5_4.1. > > Every now and then, we'll get this error in the logs: > > Mar 3 23:14:59 repl-pri lrmd: [3942]: WARN: postgresql_4:monitor > process (PID 23116) timed out (try 1). Killing with signal SIGTERM > (15). > > And then it restarts the postgresql instance. Well, it tries, and the > real issue is that sometimes it fails, leaving the cluster without a > running postgresql. At this time I think postgresql was working fine; > I certainly can't see any reason to think it wasn't responding for 60 > seconds for the timeout to happen. > > 60 seconds before this happened, there is this in the logs: > Mar 3 23:13:59 repl-pri postgres[23115]: [3-1] LOG: autovacuum: > processing database "template1" > > Autovacuum is running, but I can't tell any difference between this one > and all the other "processing template1" entries. Which database does > the monitor connect to? Is it template1? > > This used to happen with the virtual-ip_3 monitor also until I increased > its timeout from 5 seconds to 20 seconds. > > Here's more of the logs: > > Mar 3 23:14:59 repl-pri lrmd: [3942]: WARN: operation monitor[23] on > ocf::pgsql::postgresql_4 for client 3945, its parameters: > CRM_meta_interval=[120000] CRM_meta_timeout=[60000] > crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [23116] timed > outtout outmed out outoutoutut16] timed outtimed outimed out16] timed > outout > Mar 3 23:14:59 repl-pri crmd: [3945]: ERROR: process_lrm_event: LRM > operation postgresql_4_monitor_120000 (23) Timed Out (timeout=60000ms) > Mar 3 23:14:59 repl-pri crmd: [3945]: info: process_graph_event: > Detected action postgresql_4_monitor_120000 from a different transition: > 0 vs. 767 > Mar 3 23:14:59 repl-pri crmd: [3945]: info: abort_transition_graph: > process_graph_event:462 - Triggered transition abort (complete=1, > tag=lrm_rsc_op, id=postgresql_4_monitor_120000, > magic=2:-2;38:0:0:445f9445-0541-41c4-9521-9381fc902012, cib=0.33.35) : > Old eventtentventd eventventententnt) : Old eventOld eventld event > transition abort (complete=1, tag=lrmMar 3 23:14:59 repl-pri crmd: > [3945]: WARN: update_failcount: Updating failcount for postgresql_4 on > repl-pri.bart.gov after failed monitor: rc=-2 (update=value++, > time=1267686899) > Mar 3 23:14:59 repl-pri crmd: [3945]: info: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_FSA_INTERNAL origin=abort_transition_graph ] > Mar 3 23:14:59 repl-pri crmd: [3945]: info: do_state_transition: All 2 > cluster nodes are eligible to run resources. > Mar 3 23:14:59 repl-pri crmd: [3945]: info: do_pe_invoke: Query 844: > Requesting the current CIB: S_POLICY_ENGINE > Mar 3 23:14:59 repl-pri attrd: [3944]: info: attrd_local_callback: > Expanded fail-count-postgresql_4=value++ to 1 > Mar 3 23:14:59 repl-pri attrd: [3944]: info: attrd_trigger_update: > Sending flush op to all hosts for: fail-count-postgresql_4 (1) > Mar 3 23:15:00 repl-pri attrd: [3944]: info: attrd_perform_update: Sent > update 61: fail-count-postgresql_4=1 > Mar 3 23:15:00 repl-pri attrd: [3944]: info: attrd_trigger_update: > Sending flush op to all hosts for: last-failure-postgresql_4 > (1267686899) > ... > > This goes on but eventually it restarts postgresql (and the mail thing > to send a notification). What's going on? > > > Here's the configuration from crm configure show: > > > node $id="1fc6f012-13de-4846-87c8-dbdcec4977bb" repl-pri.bart.gov \ > attributes standby="off" > node $id="8ca9762b-9a67-49d6-84c6-19bc4a6034a6" repl-bak.bart.gov > primitive Filesystem_2 ocf:heartbeat:Filesystem \ > op monitor interval="120s" timeout="60s" \ > params device="/dev/drbd0" directory="/var/lib/pgsql" > fstype="ext3" options="noatime" > primitive MailTo_5 ocf:heartbeat:MailTo \ > op monitor interval="120s" timeout="60s" \ > params email="<emailaddress>" > primitive drbd_1 ocf:heartbeat:drbd \ > params drbd_resource="repdata" \ > op monitor interval="59s" role="Master" timeout="20s" \ > op monitor interval="60s" role="Slave" timeout="20s" > primitive postgresql_4 ocf:heartbeat:pgsql \ > op monitor interval="120s" timeout="60s" > primitive virtual-ip_3 ocf:heartbeat:IPaddr2 \ > op monitor interval="60s" timeout="20s" \ > params ip="<ip address>" > group postgresql_group Filesystem_2 virtual-ip_3 postgresql_4 MailTo_5 \ > meta target-role="started" > ms ms-drbd0 drbd_1 \ > meta clone-max="2" clone-node-max="1" master-max="1" > master-node-max="1" notify="yes" globally-unique="false" > target-role="started" > location rsc_location_postgresql_group ms-drbd0 \ > rule $id="preferred_location_postgresql_group" $role="master" > 100: #uname eq repl-pri.bart.gov > colocation postgresql_group_on_drbd0 inf: postgresql_group > ms-drbd0:Master > order drbd0_before_postgresql_group : ms-drbd0:promote > postgresql_group:start > property $id="cib-bootstrap-options" \ > symmetric-cluster="true" \ > no-quorum-policy="ignore" \ > default-resource-stickiness="INFINITY" \ > default-resource-failure-stickiness="0" \ > stonith-enabled="false" \ > stonith-action="reboot" \ > startup-fencing="true" \ > stop-orphan-resources="true" \ > stop-orphan-actions="true" \ > remove-after-stop="false" \ > short-resource-names="true" \ > transition-idle-timeout="5min" \ > default-action-timeout="20s" \ > is-managed-default="true" \ > cluster-delay="60s" \ > pe-error-series-max="-1" \ > pe-warn-series-max="-1" \ > pe-input-series-max="-1" \ > dc-version="1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782" \ > last-lrm-refresh="1266996274" \ > cluster-infrastructure="Heartbeat" \ > expected-quorum-votes="2" > > Here's the output from crm_mon > > ============ > Last updated: Thu Mar 4 09:27:50 2010 > Stack: Heartbeat > Current DC: repl-pri.bart.gov (1fc6f012-13de-4846-87c8-dbdcec4977bb) - > partition with quorum > Version: 1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782 > 2 Nodes configured, 2 expected votes > 2 Resources configured. > ============ > > Online: [ repl-pri.bart.gov repl-bak.bart.gov ] > > Master/Slave Set: ms-drbd0 > Masters: [ repl-pri.bart.gov ] > Slaves: [ repl-bak.bart.gov ] > Resource Group: postgresql_group > Filesystem_2 (ocf::heartbeat:Filesystem): Started > repl-pri.bart.gov > virtual-ip_3 (ocf::heartbeat:IPaddr2): Started > repl-pri.bart.gov > postgresql_4 (ocf::heartbeat:pgsql): Started repl-pri.bart.gov > MailTo_5 (ocf::heartbeat:MailTo): Started repl-pri.bart.gov > > Should I increase the timeouts or something? Any insight would be much > appreciated! > > Thanks, > Brian > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > -- Serge Dubrouski. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
