By default it monitors template1. It can be adjusted with "pgdb"
parameter. All that monitors function does is trying to run "select
now();" against monitored database. If monitor function times out that
means that following command didn't finish in time.


$OCF_RESKEY_psql -p $OCF_RESKEY_pgport -U $OCF_RESKEY_pgdba
OCF_RESKEY_pgdb -c 'select now();

Can you check that your system wasn't overloaded at the time when it
happened? It's possible that there was a heavy cronjob or just a heavy
user request against your database.

On Thu, Mar 4, 2010 at 10:42 AM, Brian Witt <[email protected]> wrote:
> Hi,
>
> Thanks to Andreas for helping me to get our systems upgraded to
> heartbeat 3.0.2.  But the underlying reason for upgrading is still
> there.  We're running CentOS 5.4 x86_64, kernel-2.6.18-164.11.1.el5,
> heartbeat-3.0.2-2.el5, pacemaker-1.0.7-4.el5, and
> postgresql-server-8.1.18-2.el5_4.1.
>
> Every now and then, we'll get this error in the logs:
>
> Mar  3 23:14:59 repl-pri lrmd: [3942]: WARN: postgresql_4:monitor
> process (PID 23116) timed out (try 1).  Killing with signal SIGTERM
> (15).
>
> And then it restarts the postgresql instance.  Well, it tries, and the
> real issue is that sometimes it fails, leaving the cluster without a
> running postgresql.  At this time I think postgresql was working fine;
> I certainly can't see any reason to think it wasn't responding for 60
> seconds for the timeout to happen.
>
> 60 seconds before this happened, there is this in the logs:
> Mar  3 23:13:59 repl-pri postgres[23115]: [3-1] LOG:  autovacuum:
> processing database "template1"
>
> Autovacuum is running, but I can't tell any difference between this one
> and all the other "processing template1" entries.  Which database does
> the monitor connect to?  Is it template1?
>
> This used to happen with the virtual-ip_3 monitor also until I increased
> its timeout from 5 seconds to 20 seconds.
>
> Here's more of the logs:
>
> Mar  3 23:14:59 repl-pri lrmd: [3942]: WARN: operation monitor[23] on
> ocf::pgsql::postgresql_4 for client 3945, its parameters:
> CRM_meta_interval=[120000] CRM_meta_timeout=[60000]
> crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [23116] timed
> outtout outmed out outoutoutut16] timed outtimed outimed out16] timed
> outout
> Mar  3 23:14:59 repl-pri crmd: [3945]: ERROR: process_lrm_event: LRM
> operation postgresql_4_monitor_120000 (23) Timed Out (timeout=60000ms)
> Mar  3 23:14:59 repl-pri crmd: [3945]: info: process_graph_event:
> Detected action postgresql_4_monitor_120000 from a different transition:
> 0 vs. 767
> Mar  3 23:14:59 repl-pri crmd: [3945]: info: abort_transition_graph:
> process_graph_event:462 - Triggered transition abort (complete=1,
> tag=lrm_rsc_op, id=postgresql_4_monitor_120000,
> magic=2:-2;38:0:0:445f9445-0541-41c4-9521-9381fc902012, cib=0.33.35) :
> Old eventtentventd eventventententnt) : Old eventOld eventld event
> transition abort (complete=1, tag=lrmMar  3 23:14:59 repl-pri crmd:
> [3945]: WARN: update_failcount: Updating failcount for postgresql_4 on
> repl-pri.bart.gov after failed monitor: rc=-2 (update=value++,
> time=1267686899)
> Mar  3 23:14:59 repl-pri crmd: [3945]: info: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Mar  3 23:14:59 repl-pri crmd: [3945]: info: do_state_transition: All 2
> cluster nodes are eligible to run resources.
> Mar  3 23:14:59 repl-pri crmd: [3945]: info: do_pe_invoke: Query 844:
> Requesting the current CIB: S_POLICY_ENGINE
> Mar  3 23:14:59 repl-pri attrd: [3944]: info: attrd_local_callback:
> Expanded fail-count-postgresql_4=value++ to 1
> Mar  3 23:14:59 repl-pri attrd: [3944]: info: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-postgresql_4 (1)
> Mar  3 23:15:00 repl-pri attrd: [3944]: info: attrd_perform_update: Sent
> update 61: fail-count-postgresql_4=1
> Mar  3 23:15:00 repl-pri attrd: [3944]: info: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-postgresql_4
> (1267686899)
> ...
>
> This goes on but eventually it restarts postgresql (and the mail thing
> to send a notification).  What's going on?
>
>
> Here's the configuration from crm configure show:
>
>
> node $id="1fc6f012-13de-4846-87c8-dbdcec4977bb" repl-pri.bart.gov \
>        attributes standby="off"
> node $id="8ca9762b-9a67-49d6-84c6-19bc4a6034a6" repl-bak.bart.gov
> primitive Filesystem_2 ocf:heartbeat:Filesystem \
>        op monitor interval="120s" timeout="60s" \
>        params device="/dev/drbd0" directory="/var/lib/pgsql"
> fstype="ext3" options="noatime"
> primitive MailTo_5 ocf:heartbeat:MailTo \
>        op monitor interval="120s" timeout="60s" \
>        params email="<emailaddress>"
> primitive drbd_1 ocf:heartbeat:drbd \
>        params drbd_resource="repdata" \
>        op monitor interval="59s" role="Master" timeout="20s" \
>        op monitor interval="60s" role="Slave" timeout="20s"
> primitive postgresql_4 ocf:heartbeat:pgsql \
>        op monitor interval="120s" timeout="60s"
> primitive virtual-ip_3 ocf:heartbeat:IPaddr2 \
>        op monitor interval="60s" timeout="20s" \
>        params ip="<ip address>"
> group postgresql_group Filesystem_2 virtual-ip_3 postgresql_4 MailTo_5 \
>        meta target-role="started"
> ms ms-drbd0 drbd_1 \
>        meta clone-max="2" clone-node-max="1" master-max="1"
> master-node-max="1" notify="yes" globally-unique="false"
> target-role="started"
> location rsc_location_postgresql_group ms-drbd0 \
>        rule $id="preferred_location_postgresql_group" $role="master"
> 100: #uname eq repl-pri.bart.gov
> colocation postgresql_group_on_drbd0 inf: postgresql_group
> ms-drbd0:Master
> order drbd0_before_postgresql_group : ms-drbd0:promote
> postgresql_group:start
> property $id="cib-bootstrap-options" \
>        symmetric-cluster="true" \
>        no-quorum-policy="ignore" \
>        default-resource-stickiness="INFINITY" \
>        default-resource-failure-stickiness="0" \
>        stonith-enabled="false" \
>        stonith-action="reboot" \
>        startup-fencing="true" \
>        stop-orphan-resources="true" \
>        stop-orphan-actions="true" \
>        remove-after-stop="false" \
>        short-resource-names="true" \
>        transition-idle-timeout="5min" \
>        default-action-timeout="20s" \
>        is-managed-default="true" \
>        cluster-delay="60s" \
>        pe-error-series-max="-1" \
>        pe-warn-series-max="-1" \
>        pe-input-series-max="-1" \
>        dc-version="1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782" \
>        last-lrm-refresh="1266996274" \
>        cluster-infrastructure="Heartbeat" \
>        expected-quorum-votes="2"
>
> Here's the output from crm_mon
>
> ============
> Last updated: Thu Mar  4 09:27:50 2010
> Stack: Heartbeat
> Current DC: repl-pri.bart.gov (1fc6f012-13de-4846-87c8-dbdcec4977bb) -
> partition with quorum
> Version: 1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> ============
>
> Online: [ repl-pri.bart.gov repl-bak.bart.gov ]
>
>  Master/Slave Set: ms-drbd0
>     Masters: [ repl-pri.bart.gov ]
>     Slaves: [ repl-bak.bart.gov ]
>  Resource Group: postgresql_group
>     Filesystem_2       (ocf::heartbeat:Filesystem):    Started 
> repl-pri.bart.gov
>     virtual-ip_3       (ocf::heartbeat:IPaddr2):       Started 
> repl-pri.bart.gov
>     postgresql_4       (ocf::heartbeat:pgsql): Started repl-pri.bart.gov
>     MailTo_5   (ocf::heartbeat:MailTo):        Started repl-pri.bart.gov
>
> Should I increase the timeouts or something?  Any insight would be much
> appreciated!
>
> Thanks,
> Brian
>
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
Serge Dubrouski.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to