On 2/23/07, Keisuke MORI <[EMAIL PROTECTED]> wrote:
Hi,
We have found a several problems with pgsql RA through our testing.
It 'fails to failover' in some scenarios.
I'm proposing a patch to fix them.
Problem description:
1) The first 'monitor' may fail even if the postmaster was
successfully launched.
This is because 'start' of the pgsql may return before the
postmaster gets ready to answer to a psql query issued by
'monitor', since it only checks the existance of postmaster
process. The postmaster can take a few minitues to get ready
to answer, particularly when it needs to recover the database
after a crash. Even if no recovery is necessary, we observed
that it sometimes fails in some of our test cases.
2) The postmaster fails to startup when 'postmaster.pid' file
was left over from the previous crash.
3) 'stop' doest not execute the fast mode shutdown effectively,
because it executes the immediate mode shutdown at the very
next moment. The fast mode shutdown can take a few minutes
to complete to flush the database log.
This isn't a critical problem, but it may result to take a
time longer to complete the failover (according to our
database team). It is preferable to wait to complete the fast
mode shutdown as long as possible.
Proposals to fix:
1) In 'start', wait until the postmaster gets ready to answer by
checking as same as 'monitor' does.
The maximum wait time to complete to startup can be
customized by an additional parameter 'start_wait'.
2) Add a cleanup code for 'postmaster.pid' when stop and before starting.
3) In 'stop', wait until the postmaster completes to the fast
mode shutdown.
The maximum wait time to complete to shutdown can be
customized by an additional parameter 'stop_wait.
The attached patch is for the latest -dev.
I'd be more inclined to go with something like the patch below.
The function of start_wait and stop_wait is just as easily achieved by
setting the action's timeout. Its also harder to mess up (ie. by
setting start_wait to longer than the start action's timeout).
diff -r 959f2c429fc3 resources/OCF/pgsql.in
--- a/resources/OCF/pgsql.in Fri Feb 23 10:59:12 2007 +0100
+++ b/resources/OCF/pgsql.in Fri Feb 23 12:18:53 2007 +0100
@@ -197,15 +197,12 @@ pgsql_start() {
return $OCF_ERR_GENERIC
fi
- if ! pgsql_status
- then
- sleep 5
- if ! pgsql_status
- then
- echo "ERROR: PostgreSQL is not running!"
- return $OCF_ERR_GENERIC
- fi
- fi
+
+ active=0
+ while [ $active != 0 ]; do
+ pgsql_monitor
+ active=$?
+ done
return $OCF_SUCCESS
}
@@ -227,6 +224,13 @@ pgsql_stop() {
runasowner "$PGCTL -D $PGDATA stop -m immediate > /dev/null 2>&1"
fi
+ active=$OCF_NOT_RUNNING
+ while [ $active != $OCF_NOT_RUNNING ]; do
+ pgsql_monitor
+ active=$?
+ done
+
+ rm -f $PIDFILE
return $OCF_SUCCESS
}
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/