Hi,

Thanks to Andreas for helping me to get our systems upgraded to
heartbeat 3.0.2.  But the underlying reason for upgrading is still
there.  We're running CentOS 5.4 x86_64, kernel-2.6.18-164.11.1.el5,
heartbeat-3.0.2-2.el5, pacemaker-1.0.7-4.el5, and
postgresql-server-8.1.18-2.el5_4.1.

Every now and then, we'll get this error in the logs:

Mar  3 23:14:59 repl-pri lrmd: [3942]: WARN: postgresql_4:monitor
process (PID 23116) timed out (try 1).  Killing with signal SIGTERM
(15).

And then it restarts the postgresql instance.  Well, it tries, and the
real issue is that sometimes it fails, leaving the cluster without a
running postgresql.  At this time I think postgresql was working fine;
I certainly can't see any reason to think it wasn't responding for 60
seconds for the timeout to happen. 

60 seconds before this happened, there is this in the logs:
Mar  3 23:13:59 repl-pri postgres[23115]: [3-1] LOG:  autovacuum:
processing database "template1"

Autovacuum is running, but I can't tell any difference between this one
and all the other "processing template1" entries.  Which database does
the monitor connect to?  Is it template1?

This used to happen with the virtual-ip_3 monitor also until I increased
its timeout from 5 seconds to 20 seconds.

Here's more of the logs:

Mar  3 23:14:59 repl-pri lrmd: [3942]: WARN: operation monitor[23] on
ocf::pgsql::postgresql_4 for client 3945, its parameters:
CRM_meta_interval=[120000] CRM_meta_timeout=[60000]
crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [23116] timed
outtout outmed out outoutoutut16] timed outtimed outimed out16] timed
outout
Mar  3 23:14:59 repl-pri crmd: [3945]: ERROR: process_lrm_event: LRM
operation postgresql_4_monitor_120000 (23) Timed Out (timeout=60000ms)
Mar  3 23:14:59 repl-pri crmd: [3945]: info: process_graph_event:
Detected action postgresql_4_monitor_120000 from a different transition:
0 vs. 767
Mar  3 23:14:59 repl-pri crmd: [3945]: info: abort_transition_graph:
process_graph_event:462 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=postgresql_4_monitor_120000,
magic=2:-2;38:0:0:445f9445-0541-41c4-9521-9381fc902012, cib=0.33.35) :
Old eventtentventd eventventententnt) : Old eventOld eventld event
transition abort (complete=1, tag=lrmMar  3 23:14:59 repl-pri crmd:
[3945]: WARN: update_failcount: Updating failcount for postgresql_4 on
repl-pri.bart.gov after failed monitor: rc=-2 (update=value++,
time=1267686899)
Mar  3 23:14:59 repl-pri crmd: [3945]: info: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Mar  3 23:14:59 repl-pri crmd: [3945]: info: do_state_transition: All 2
cluster nodes are eligible to run resources.
Mar  3 23:14:59 repl-pri crmd: [3945]: info: do_pe_invoke: Query 844:
Requesting the current CIB: S_POLICY_ENGINE
Mar  3 23:14:59 repl-pri attrd: [3944]: info: attrd_local_callback:
Expanded fail-count-postgresql_4=value++ to 1
Mar  3 23:14:59 repl-pri attrd: [3944]: info: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-postgresql_4 (1)
Mar  3 23:15:00 repl-pri attrd: [3944]: info: attrd_perform_update: Sent
update 61: fail-count-postgresql_4=1
Mar  3 23:15:00 repl-pri attrd: [3944]: info: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-postgresql_4
(1267686899)
...

This goes on but eventually it restarts postgresql (and the mail thing
to send a notification).  What's going on?  


Here's the configuration from crm configure show:


node $id="1fc6f012-13de-4846-87c8-dbdcec4977bb" repl-pri.bart.gov \
        attributes standby="off"
node $id="8ca9762b-9a67-49d6-84c6-19bc4a6034a6" repl-bak.bart.gov
primitive Filesystem_2 ocf:heartbeat:Filesystem \
        op monitor interval="120s" timeout="60s" \
        params device="/dev/drbd0" directory="/var/lib/pgsql"
fstype="ext3" options="noatime"
primitive MailTo_5 ocf:heartbeat:MailTo \
        op monitor interval="120s" timeout="60s" \
        params email="<emailaddress>"
primitive drbd_1 ocf:heartbeat:drbd \
        params drbd_resource="repdata" \
        op monitor interval="59s" role="Master" timeout="20s" \
        op monitor interval="60s" role="Slave" timeout="20s"
primitive postgresql_4 ocf:heartbeat:pgsql \
        op monitor interval="120s" timeout="60s"
primitive virtual-ip_3 ocf:heartbeat:IPaddr2 \
        op monitor interval="60s" timeout="20s" \
        params ip="<ip address>"
group postgresql_group Filesystem_2 virtual-ip_3 postgresql_4 MailTo_5 \
        meta target-role="started"
ms ms-drbd0 drbd_1 \
        meta clone-max="2" clone-node-max="1" master-max="1"
master-node-max="1" notify="yes" globally-unique="false"
target-role="started"
location rsc_location_postgresql_group ms-drbd0 \
        rule $id="preferred_location_postgresql_group" $role="master"
100: #uname eq repl-pri.bart.gov
colocation postgresql_group_on_drbd0 inf: postgresql_group
ms-drbd0:Master
order drbd0_before_postgresql_group : ms-drbd0:promote
postgresql_group:start
property $id="cib-bootstrap-options" \
        symmetric-cluster="true" \
        no-quorum-policy="ignore" \
        default-resource-stickiness="INFINITY" \
        default-resource-failure-stickiness="0" \
        stonith-enabled="false" \
        stonith-action="reboot" \
        startup-fencing="true" \
        stop-orphan-resources="true" \
        stop-orphan-actions="true" \
        remove-after-stop="false" \
        short-resource-names="true" \
        transition-idle-timeout="5min" \
        default-action-timeout="20s" \
        is-managed-default="true" \
        cluster-delay="60s" \
        pe-error-series-max="-1" \
        pe-warn-series-max="-1" \
        pe-input-series-max="-1" \
        dc-version="1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782" \
        last-lrm-refresh="1266996274" \
        cluster-infrastructure="Heartbeat" \
        expected-quorum-votes="2"

Here's the output from crm_mon

============
Last updated: Thu Mar  4 09:27:50 2010
Stack: Heartbeat
Current DC: repl-pri.bart.gov (1fc6f012-13de-4846-87c8-dbdcec4977bb) -
partition with quorum
Version: 1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ repl-pri.bart.gov repl-bak.bart.gov ]

 Master/Slave Set: ms-drbd0
     Masters: [ repl-pri.bart.gov ]
     Slaves: [ repl-bak.bart.gov ]
 Resource Group: postgresql_group
     Filesystem_2       (ocf::heartbeat:Filesystem):    Started 
repl-pri.bart.gov
     virtual-ip_3       (ocf::heartbeat:IPaddr2):       Started 
repl-pri.bart.gov
     postgresql_4       (ocf::heartbeat:pgsql): Started repl-pri.bart.gov
     MailTo_5   (ocf::heartbeat:MailTo):        Started repl-pri.bart.gov

Should I increase the timeouts or something?  Any insight would be much
appreciated!

Thanks,
Brian


Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to