Well, there is one cronjob that ran about 10 minutes before this which
is fairly read-intensive (dumping data out to a file).  I'm not sure if
it was still running at the time this happened, but I'll be watching
more closely now.

But is it possible that PostgreSQL was busy for 60 seconds, so busy that
it would not allow a simple connect and "select now()"?  This is a
fairly powerful machine, dual quad core Xeon X5460s at 3.16GHz with
~50GB RAM and 15k RPM disks in a RAID 10.  Or maybe I need to tune
PostgreSQL better?  

Thanks,
Brian

On Thu, 2010-03-04 at 11:09 -0700, Serge Dubrouski wrote:
> By default it monitors template1. It can be adjusted with "pgdb"
> parameter. All that monitors function does is trying to run "select
> now();" against monitored database. If monitor function times out that
> means that following command didn't finish in time.
> 
> 
> $OCF_RESKEY_psql -p $OCF_RESKEY_pgport -U $OCF_RESKEY_pgdba
> OCF_RESKEY_pgdb -c 'select now();
> 
> Can you check that your system wasn't overloaded at the time when it
> happened? It's possible that there was a heavy cronjob or just a heavy
> user request against your database.
> 
> On Thu, Mar 4, 2010 at 10:42 AM, Brian Witt <[email protected]> wrote:
> > Hi,
> >
> > Thanks to Andreas for helping me to get our systems upgraded to
> > heartbeat 3.0.2.  But the underlying reason for upgrading is still
> > there.  We're running CentOS 5.4 x86_64, kernel-2.6.18-164.11.1.el5,
> > heartbeat-3.0.2-2.el5, pacemaker-1.0.7-4.el5, and
> > postgresql-server-8.1.18-2.el5_4.1.
> >
> > Every now and then, we'll get this error in the logs:
> >
> > Mar  3 23:14:59 repl-pri lrmd: [3942]: WARN: postgresql_4:monitor
> > process (PID 23116) timed out (try 1).  Killing with signal SIGTERM
> > (15).
> >
> > And then it restarts the postgresql instance.  Well, it tries, and the
> > real issue is that sometimes it fails, leaving the cluster without a
> > running postgresql.  At this time I think postgresql was working fine;
> > I certainly can't see any reason to think it wasn't responding for 60
> > seconds for the timeout to happen.
> >
> > 60 seconds before this happened, there is this in the logs:
> > Mar  3 23:13:59 repl-pri postgres[23115]: [3-1] LOG:  autovacuum:
> > processing database "template1"
> >
> > Autovacuum is running, but I can't tell any difference between this one
> > and all the other "processing template1" entries.  Which database does
> > the monitor connect to?  Is it template1?
> >
> > This used to happen with the virtual-ip_3 monitor also until I increased
> > its timeout from 5 seconds to 20 seconds.
> >
> > Here's more of the logs:
> >
> > Mar  3 23:14:59 repl-pri lrmd: [3942]: WARN: operation monitor[23] on
> > ocf::pgsql::postgresql_4 for client 3945, its parameters:
> > CRM_meta_interval=[120000] CRM_meta_timeout=[60000]
> > crm_feature_set=[3.0.1] CRM_meta_name=[monitor] : pid [23116] timed
> > outtout outmed out outoutoutut16] timed outtimed outimed out16] timed
> > outout
> > Mar  3 23:14:59 repl-pri crmd: [3945]: ERROR: process_lrm_event: LRM
> > operation postgresql_4_monitor_120000 (23) Timed Out (timeout=60000ms)
> > Mar  3 23:14:59 repl-pri crmd: [3945]: info: process_graph_event:
> > Detected action postgresql_4_monitor_120000 from a different transition:
> > 0 vs. 767
> > Mar  3 23:14:59 repl-pri crmd: [3945]: info: abort_transition_graph:
> > process_graph_event:462 - Triggered transition abort (complete=1,
> > tag=lrm_rsc_op, id=postgresql_4_monitor_120000,
> > magic=2:-2;38:0:0:445f9445-0541-41c4-9521-9381fc902012, cib=0.33.35) :
> > Old eventtentventd eventventententnt) : Old eventOld eventld event
> > transition abort (complete=1, tag=lrmMar  3 23:14:59 repl-pri crmd:
> > [3945]: WARN: update_failcount: Updating failcount for postgresql_4 on
> > repl-pri.bart.gov after failed monitor: rc=-2 (update=value++,
> > time=1267686899)
> > Mar  3 23:14:59 repl-pri crmd: [3945]: info: do_state_transition: State
> > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> > cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> > Mar  3 23:14:59 repl-pri crmd: [3945]: info: do_state_transition: All 2
> > cluster nodes are eligible to run resources.
> > Mar  3 23:14:59 repl-pri crmd: [3945]: info: do_pe_invoke: Query 844:
> > Requesting the current CIB: S_POLICY_ENGINE
> > Mar  3 23:14:59 repl-pri attrd: [3944]: info: attrd_local_callback:
> > Expanded fail-count-postgresql_4=value++ to 1
> > Mar  3 23:14:59 repl-pri attrd: [3944]: info: attrd_trigger_update:
> > Sending flush op to all hosts for: fail-count-postgresql_4 (1)
> > Mar  3 23:15:00 repl-pri attrd: [3944]: info: attrd_perform_update: Sent
> > update 61: fail-count-postgresql_4=1
> > Mar  3 23:15:00 repl-pri attrd: [3944]: info: attrd_trigger_update:
> > Sending flush op to all hosts for: last-failure-postgresql_4
> > (1267686899)
> > ...
> >
> > This goes on but eventually it restarts postgresql (and the mail thing
> > to send a notification).  What's going on?
> >
> >
> > Here's the configuration from crm configure show:
> >
> >
> > node $id="1fc6f012-13de-4846-87c8-dbdcec4977bb" repl-pri.bart.gov \
> >        attributes standby="off"
> > node $id="8ca9762b-9a67-49d6-84c6-19bc4a6034a6" repl-bak.bart.gov
> > primitive Filesystem_2 ocf:heartbeat:Filesystem \
> >        op monitor interval="120s" timeout="60s" \
> >        params device="/dev/drbd0" directory="/var/lib/pgsql"
> > fstype="ext3" options="noatime"
> > primitive MailTo_5 ocf:heartbeat:MailTo \
> >        op monitor interval="120s" timeout="60s" \
> >        params email="<emailaddress>"
> > primitive drbd_1 ocf:heartbeat:drbd \
> >        params drbd_resource="repdata" \
> >        op monitor interval="59s" role="Master" timeout="20s" \
> >        op monitor interval="60s" role="Slave" timeout="20s"
> > primitive postgresql_4 ocf:heartbeat:pgsql \
> >        op monitor interval="120s" timeout="60s"
> > primitive virtual-ip_3 ocf:heartbeat:IPaddr2 \
> >        op monitor interval="60s" timeout="20s" \
> >        params ip="<ip address>"
> > group postgresql_group Filesystem_2 virtual-ip_3 postgresql_4 MailTo_5 \
> >        meta target-role="started"
> > ms ms-drbd0 drbd_1 \
> >        meta clone-max="2" clone-node-max="1" master-max="1"
> > master-node-max="1" notify="yes" globally-unique="false"
> > target-role="started"
> > location rsc_location_postgresql_group ms-drbd0 \
> >        rule $id="preferred_location_postgresql_group" $role="master"
> > 100: #uname eq repl-pri.bart.gov
> > colocation postgresql_group_on_drbd0 inf: postgresql_group
> > ms-drbd0:Master
> > order drbd0_before_postgresql_group : ms-drbd0:promote
> > postgresql_group:start
> > property $id="cib-bootstrap-options" \
> >        symmetric-cluster="true" \
> >        no-quorum-policy="ignore" \
> >        default-resource-stickiness="INFINITY" \
> >        default-resource-failure-stickiness="0" \
> >        stonith-enabled="false" \
> >        stonith-action="reboot" \
> >        startup-fencing="true" \
> >        stop-orphan-resources="true" \
> >        stop-orphan-actions="true" \
> >        remove-after-stop="false" \
> >        short-resource-names="true" \
> >        transition-idle-timeout="5min" \
> >        default-action-timeout="20s" \
> >        is-managed-default="true" \
> >        cluster-delay="60s" \
> >        pe-error-series-max="-1" \
> >        pe-warn-series-max="-1" \
> >        pe-input-series-max="-1" \
> >        dc-version="1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782" \
> >        last-lrm-refresh="1266996274" \
> >        cluster-infrastructure="Heartbeat" \
> >        expected-quorum-votes="2"
> >
> > Here's the output from crm_mon
> >
> > ============
> > Last updated: Thu Mar  4 09:27:50 2010
> > Stack: Heartbeat
> > Current DC: repl-pri.bart.gov (1fc6f012-13de-4846-87c8-dbdcec4977bb) -
> > partition with quorum
> > Version: 1.0.7-d3fa20fc76c7947d6de66db7e52526dc6bd7d782
> > 2 Nodes configured, 2 expected votes
> > 2 Resources configured.
> > ============
> >
> > Online: [ repl-pri.bart.gov repl-bak.bart.gov ]
> >
> >  Master/Slave Set: ms-drbd0
> >     Masters: [ repl-pri.bart.gov ]
> >     Slaves: [ repl-bak.bart.gov ]
> >  Resource Group: postgresql_group
> >     Filesystem_2       (ocf::heartbeat:Filesystem):    Started 
> > repl-pri.bart.gov
> >     virtual-ip_3       (ocf::heartbeat:IPaddr2):       Started 
> > repl-pri.bart.gov
> >     postgresql_4       (ocf::heartbeat:pgsql): Started repl-pri.bart.gov
> >     MailTo_5   (ocf::heartbeat:MailTo):        Started repl-pri.bart.gov
> >
> > Should I increase the timeouts or something?  Any insight would be much
> > appreciated!
> >
> > Thanks,
> > Brian
> >
> >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> 
> 
> 
> -- 
> Serge Dubrouski.
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

-- 
Brian Witt
Senior Engineer
Fare Collection Engineering
San Francisco Bay Area Rapid Transit District
Phone 510.464.6568
Fax 510.464.6899

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to